Incident Response in Platform Engineering
Overview
Why Platform Incident Response is Different
Aspect
Application Incident
Platform Incident
Service Level Objectives (SLOs)
SLI (Service Level Indicator): The measured metric
β e.g., % of CI pipeline runs completing successfully
SLO (Service Level Objective): The target threshold
β e.g., 99.5% of CI runs succeed over a rolling 28-day window
SLA (Service Level Agreement): The contract (if external)
β e.g., Platform team commits to 99.5% CI success to product teamsPlatform SLO Examples
Platform Capability
SLI
SLO
Error Budget
Incident Response Lifecycle
1. Detection
2. Triage
Severity
Definition
Response Time
3. Mitigation (Not Root Cause)
4. Resolution
Runbooks
Runbook Template
Post-Incident Reviews (Post-Mortems)
Post-Mortem Structure
5 Whys Example
On-Call Best Practices for Platform Teams
Practice
Description
Key Takeaways
Further Reading
Last updated