# Incident Response in Platform Engineering

> **CNPA Domain:** Continuous Delivery & Platform Engineering (16%) **Topic:** Incident Response in Platform Engineering

## Overview

Platform engineers are responsible for the reliability of the platform itself — the infrastructure, pipelines, and services that all other teams depend on. When something goes wrong, the platform team's response directly affects every team in the organization. Mature **incident response** practices minimize the blast radius, restore service quickly, and prevent recurrence.

***

## Why Platform Incident Response is Different

A platform incident is unlike a typical application incident:

| Aspect           | Application Incident      | Platform Incident                         |
| ---------------- | ------------------------- | ----------------------------------------- |
| **Blast radius** | One service / one team    | Every team using the platform             |
| **Visibility**   | Affects a subset of users | Often affects all developers              |
| **Dependencies** | Known service graph       | May affect CI/CD, deployments, monitoring |
| **Pressure**     | Business impact           | Developer productivity at scale           |

A CI/CD pipeline being down, a cluster API server being unresponsive, or a shared service mesh failing can block the entire organization from shipping.

***

## Service Level Objectives (SLOs)

Before you can respond to incidents, you need to define what "healthy" means. **SLOs** are the primary mechanism.

```
SLI (Service Level Indicator): The measured metric
  → e.g., % of CI pipeline runs completing successfully

SLO (Service Level Objective): The target threshold
  → e.g., 99.5% of CI runs succeed over a rolling 28-day window

SLA (Service Level Agreement): The contract (if external)
  → e.g., Platform team commits to 99.5% CI success to product teams
```

### Platform SLO Examples

| Platform Capability       | SLI                            | SLO    |
| ------------------------- | ------------------------------ | ------ |
| **CI pipeline**           | % runs completing in < 15 min  | 95%    |
| **Cluster API server**    | % requests < 500ms             | 99.9%  |
| **GitOps reconciliation** | % changes applied within 5 min | 99.5%  |
| **Secret injection**      | % pod starts succeeding        | 99.9%  |
| **Container registry**    | % pulls succeeding             | 99.95% |

### Error Budget

```
Monthly error budget = (1 - SLO) × total window
Example: (1 - 0.999) × 43,200 min = 43.2 minutes/month downtime allowed

If budget is consumed → freeze non-essential changes, focus on reliability
If budget has headroom → enable risk-taking, deploy new features
```

***

## Incident Response Lifecycle

```
Detection → Triage → Mitigation → Resolution → Post-Incident Review
```

### 1. Detection

Detection should be automated via alerting, not discovered by frustrated developers:

```yaml
# Prometheus alert: GitOps reconciliation lag
groups:
  - name: platform.gitops
    rules:
      - alert: GitOpsReconciliationLag
        expr: |
          (time() - gitops_reconciliation_last_success_timestamp) > 300
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "GitOps reconciliation is lagging > 5 minutes"
          runbook: "https://runbooks.example.com/gitops-lag"

      - alert: CIPipelineFailureRate
        expr: |
          rate(ci_pipeline_failures_total[5m]) /
          rate(ci_pipeline_runs_total[5m]) > 0.10
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "CI pipeline failure rate > 10%"
          runbook: "https://runbooks.example.com/ci-failure-rate"
```

### 2. Triage

**Incident Commander (IC):** One person owns the incident. Others assist.

```
IC: Declares incident, opens incident channel (#incident-YYYY-MM-DD-ci-failure)
IC: Assigns roles:
    - Communications lead (updates stakeholders)
    - Technical lead (investigates root cause)
    - Scribe (documents timeline in real time)
```

**Severity Levels:**

| Severity  | Definition                             | Response Time     |
| --------- | -------------------------------------- | ----------------- |
| **SEV-1** | Platform down, blocking all teams      | Immediate         |
| **SEV-2** | Major degradation, blocking most teams | < 15 min          |
| **SEV-3** | Partial impact, workaround available   | < 1 hour          |
| **SEV-4** | Minor impact, monitored                | Next business day |

### 3. Mitigation (Not Root Cause)

**Mitigation goal: restore service.** Root cause is for later.

Common mitigation strategies for platform incidents:

```bash
# Rollback a bad GitOps change
git revert HEAD
git push origin main
# ArgoCD/Flux reconciles automatically

# Scale up over-loaded component
kubectl scale deployment argocd-server -n argocd --replicas=5

# Restart a crashing controller
kubectl rollout restart deployment flux-source-controller -n flux-system

# Circuit break: pause GitOps sync while investigating
kubectl patch app payment-service -n argocd \
  --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'
```

### 4. Resolution

Once service is restored, document:

* Timeline of events
* Actions taken
* Mitigation vs fix status
* Impact metrics (duration, teams affected)

***

## Runbooks

A **runbook** is a documented procedure for responding to a specific alert or condition. Runbooks let any on-call engineer respond effectively, not just the original author.

### Runbook Template

```markdown
# Runbook: CI Pipeline High Failure Rate

## Alert
CIPipelineFailureRate > 10% for 10 minutes

## Impact
Developers cannot merge or deploy changes.

## Investigation Steps
1. Check CI runner health: https://github.com/org/settings/actions/runners
2. Check GitHub Actions status: https://www.githubstatus.com/
3. Check recent workflow changes: `git log --since=1h .github/workflows/`
4. Check runner resource usage: Prometheus query `ci_runner_cpu_usage`

## Mitigation Options
- If runner crash: restart runner container
  `kubectl rollout restart deployment github-runner -n ci`
- If GitHub outage: check status page, communicate ETA to teams
- If bad workflow commit: revert and merge
  `git revert <commit-sha>`

## Escalation
If not resolved in 30 min: escalate to Platform Lead
Slack: #platform-oncall

## Post-Incident
File post-mortem within 48 hours.
```

***

## Post-Incident Reviews (Post-Mortems)

Post-mortems are **blameless** analyses of what happened and how to prevent recurrence.

### Post-Mortem Structure

```
1. Summary (1 paragraph)
2. Timeline (what happened, when, who noticed)
3. Root cause analysis (5 Whys / Fishbone diagram)
4. Impact (duration, teams affected, SLO consumption)
5. What went well
6. What went poorly
7. Action items (owner + due date)
```

### 5 Whys Example

```
Problem: ArgoCD failed to deploy to production for 45 minutes

Why 1: ArgoCD application was stuck in "Progressing"
Why 2: The Deployment never became Ready
Why 3: Pods were failing with OOMKilled
Why 4: Memory limit was set too low (128Mi for a 200Mi heap app)
Why 5: No enforced memory limit policy + no load testing before rollout

Root cause: No minimum memory requirement policy + no staging performance test

Action items:
  - Add Kyverno policy: minimum memory limit 256Mi (owner: @alice, due: 2026-03-20)
  - Add load test stage to CD pipeline (owner: @bob, due: 2026-04-01)
```

***

## On-Call Best Practices for Platform Teams

| Practice                           | Description                                             |
| ---------------------------------- | ------------------------------------------------------- |
| **Page only on actionable alerts** | Tune thresholds; noisy alerts cause alert fatigue       |
| **On-call rotation**               | Distribute burden; no single-person bus factor          |
| **Escalation policies**            | Clear chain: alert → oncall → lead → manager            |
| **On-call handoff**                | 15-min handoff meeting; document current state          |
| **Measure on-call health**         | Track pages/shift, time-to-acknowledge, resolution time |

***

## Key Takeaways

* Platform incidents have wide blast radius — they affect every team using the platform
* **SLOs + error budgets** define "healthy" and guide reliability investment priorities
* **Automated alerting with runbook links** enables fast, consistent incident response
* Declare a clear **Incident Commander** to avoid confusion during high-stress events
* Focus on **mitigation first** (restore service), root cause second
* **Blameless post-mortems** with concrete action items prevent recurrence

***

## Further Reading

* [Google SRE Book — Incident Management](https://sre.google/sre-book/managing-incidents/)
* [DORA — Incident Management](https://dora.dev/)
* [PagerDuty Incident Response Guide](https://response.pagerduty.com/)
* [OpenSLO Specification](https://openslo.com/)
* → Next: [Building a Platform Team](https://blog.htunnthuthu.com/getting-started/fundamentals/platform-engineering-101/platform-engineering-101-platform-team)
