Incident Response in Platform Engineering

CNPA Domain: Continuous Delivery & Platform Engineering (16%) Topic: Incident Response in Platform Engineering

Overview

Platform engineers are responsible for the reliability of the platform itself — the infrastructure, pipelines, and services that all other teams depend on. When something goes wrong, the platform team's response directly affects every team in the organization. Mature incident response practices minimize the blast radius, restore service quickly, and prevent recurrence.

Why Platform Incident Response is Different

A platform incident is unlike a typical application incident:

Aspect

Application Incident

Platform Incident

Blast radius

One service / one team

Every team using the platform

Visibility

Affects a subset of users

Often affects all developers

Dependencies

Known service graph

May affect CI/CD, deployments, monitoring

Pressure

Business impact

Developer productivity at scale

A CI/CD pipeline being down, a cluster API server being unresponsive, or a shared service mesh failing can block the entire organization from shipping.

Service Level Objectives (SLOs)

Before you can respond to incidents, you need to define what "healthy" means. SLOs are the primary mechanism.

SLI (Service Level Indicator): The measured metric
  → e.g., % of CI pipeline runs completing successfully

SLO (Service Level Objective): The target threshold
  → e.g., 99.5% of CI runs succeed over a rolling 28-day window

SLA (Service Level Agreement): The contract (if external)
  → e.g., Platform team commits to 99.5% CI success to product teams

Platform SLO Examples

Platform Capability

SLI

SLO

CI pipeline

% runs completing in < 15 min

95%

Cluster API server

% requests < 500ms

99.9%

GitOps reconciliation

% changes applied within 5 min

99.5%

Secret injection

% pod starts succeeding

99.9%

Container registry

% pulls succeeding

99.95%

Error Budget

Monthly error budget = (1 - SLO) × total window
Example: (1 - 0.999) × 43,200 min = 43.2 minutes/month downtime allowed

If budget is consumed → freeze non-essential changes, focus on reliability
If budget has headroom → enable risk-taking, deploy new features

Incident Response Lifecycle

Detection → Triage → Mitigation → Resolution → Post-Incident Review

1. Detection

Detection should be automated via alerting, not discovered by frustrated developers:

# Prometheus alert: GitOps reconciliation lag
groups:
  - name: platform.gitops
    rules:
      - alert: GitOpsReconciliationLag
        expr: |
          (time() - gitops_reconciliation_last_success_timestamp) > 300
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "GitOps reconciliation is lagging > 5 minutes"
          runbook: "https://runbooks.example.com/gitops-lag"

      - alert: CIPipelineFailureRate
        expr: |
          rate(ci_pipeline_failures_total[5m]) /
          rate(ci_pipeline_runs_total[5m]) > 0.10
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "CI pipeline failure rate > 10%"
          runbook: "https://runbooks.example.com/ci-failure-rate"

2. Triage

Incident Commander (IC): One person owns the incident. Others assist.

IC: Declares incident, opens incident channel (#incident-YYYY-MM-DD-ci-failure)
IC: Assigns roles:
    - Communications lead (updates stakeholders)
    - Technical lead (investigates root cause)
    - Scribe (documents timeline in real time)

Severity Levels:

Severity

Definition

Response Time

SEV-1

Platform down, blocking all teams

Immediate

SEV-2

Major degradation, blocking most teams

< 15 min

SEV-3

Partial impact, workaround available

< 1 hour

SEV-4

Minor impact, monitored

Next business day

3. Mitigation (Not Root Cause)

Mitigation goal: restore service. Root cause is for later.

Common mitigation strategies for platform incidents:

# Rollback a bad GitOps change
git revert HEAD
git push origin main
# ArgoCD/Flux reconciles automatically

# Scale up over-loaded component
kubectl scale deployment argocd-server -n argocd --replicas=5

# Restart a crashing controller
kubectl rollout restart deployment flux-source-controller -n flux-system

# Circuit break: pause GitOps sync while investigating
kubectl patch app payment-service -n argocd \
  --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'

4. Resolution

Once service is restored, document:

Timeline of events
Actions taken
Mitigation vs fix status
Impact metrics (duration, teams affected)

Runbooks

A runbook is a documented procedure for responding to a specific alert or condition. Runbooks let any on-call engineer respond effectively, not just the original author.

Runbook Template

# Runbook: CI Pipeline High Failure Rate

## Alert
CIPipelineFailureRate > 10% for 10 minutes

## Impact
Developers cannot merge or deploy changes.

## Investigation Steps
1. Check CI runner health: https://github.com/org/settings/actions/runners
2. Check GitHub Actions status: https://www.githubstatus.com/
3. Check recent workflow changes: `git log --since=1h .github/workflows/`
4. Check runner resource usage: Prometheus query `ci_runner_cpu_usage`

## Mitigation Options
- If runner crash: restart runner container
  `kubectl rollout restart deployment github-runner -n ci`
- If GitHub outage: check status page, communicate ETA to teams
- If bad workflow commit: revert and merge
  `git revert <commit-sha>`

## Escalation
If not resolved in 30 min: escalate to Platform Lead
Slack: #platform-oncall

## Post-Incident
File post-mortem within 48 hours.

Post-Incident Reviews (Post-Mortems)

Post-mortems are blameless analyses of what happened and how to prevent recurrence.

Post-Mortem Structure

1. Summary (1 paragraph)
2. Timeline (what happened, when, who noticed)
3. Root cause analysis (5 Whys / Fishbone diagram)
4. Impact (duration, teams affected, SLO consumption)
5. What went well
6. What went poorly
7. Action items (owner + due date)

5 Whys Example

Problem: ArgoCD failed to deploy to production for 45 minutes

Why 1: ArgoCD application was stuck in "Progressing"
Why 2: The Deployment never became Ready
Why 3: Pods were failing with OOMKilled
Why 4: Memory limit was set too low (128Mi for a 200Mi heap app)
Why 5: No enforced memory limit policy + no load testing before rollout

Root cause: No minimum memory requirement policy + no staging performance test

Action items:
  - Add Kyverno policy: minimum memory limit 256Mi (owner: @alice, due: 2026-03-20)
  - Add load test stage to CD pipeline (owner: @bob, due: 2026-04-01)

On-Call Best Practices for Platform Teams

Practice

Description

Page only on actionable alerts

Tune thresholds; noisy alerts cause alert fatigue

On-call rotation

Distribute burden; no single-person bus factor

Escalation policies

Clear chain: alert → oncall → lead → manager

On-call handoff

15-min handoff meeting; document current state

Measure on-call health

Track pages/shift, time-to-acknowledge, resolution time

Key Takeaways

Platform incidents have wide blast radius — they affect every team using the platform
SLOs + error budgets define "healthy" and guide reliability investment priorities
Automated alerting with runbook links enables fast, consistent incident response
Declare a clear Incident Commander to avoid confusion during high-stress events
Focus on mitigation first (restore service), root cause second
Blameless post-mortems with concrete action items prevent recurrence

hashtagOverview

hashtagWhy Platform Incident Response is Different

hashtagService Level Objectives (SLOs)

hashtagPlatform SLO Examples

hashtagError Budget

hashtagIncident Response Lifecycle

hashtag1. Detection

hashtag2. Triage

hashtag3. Mitigation (Not Root Cause)

hashtag4. Resolution

hashtagRunbooks

hashtagRunbook Template

hashtagPost-Incident Reviews (Post-Mortems)

hashtagPost-Mortem Structure

hashtag5 Whys Example

hashtagOn-Call Best Practices for Platform Teams

hashtagKey Takeaways

hashtagFurther Reading