Part 7: Incident Response and Management

The 3 AM Wake-Up Call That Changed Everything

It was 3:17 AM when my phone started buzzing. PagerDuty alert: payment processing down. I fumbled for my laptop, VPN'd in, and stared at dashboards trying to understand what was happening. Twenty minutes of chaos later, I realized our database had run out of connections. By the time we fixed it, we'd lost $50,000 in failed transactions.

The worst part? Two weeks earlier, we'd seen warning signs in our metrics but didn't act on them. That incident taught me that incident response isn't just about fixing problems fastβ€”it's about detecting them early, coordinating effectively, and learning from every failure.

The Four Phases of Incident Response

Every incident follows these phases:

  1. Detection: Identifying that something is wrong

  2. Triage: Understanding severity and impact

  3. Mitigation: Stopping the bleeding

  4. Resolution: Permanently fixing the root cause

  5. Prevention: Ensuring it doesn't happen again (postmortem)

Let me walk through each phase with real processes I use.

Phase 1: Detection

The faster you detect incidents, the less damage they cause. I use multiple detection methods.

Automated Monitoring Alerts

Prometheus alerts notify me when things go wrong:

Health Check Monitoring

Every service exposes health endpoints:

Synthetic Monitoring

I run automated tests against production every minute:

User-Reported Issues

Sometimes users notice problems before our monitoring does. I integrate customer support tools:

Phase 2: Triage

When an alert fires, the first step is understanding severity and impact.

Incident Severity Levels

I use four severity levels:

Severity
Definition
Response Time
Example

SEV1 (Critical)

Complete service outage or data loss

Immediate (< 5 min)

Payment processing down, database corruption

SEV2 (High)

Major feature degraded, workaround available

15 minutes

Checkout slow but functional, email delivery delayed

SEV3 (Medium)

Minor feature broken, limited user impact

1 hour

PDF export failing, search results incomplete

SEV4 (Low)

Cosmetic issue, no functional impact

Next business day

Logo misaligned, typo in email

Triage Checklist

When I receive an alert, I follow this checklist:

Incident Command Structure

For SEV1/SEV2 incidents, I assign roles:

Incident Commander (IC): Coordinates response, makes decisions Tech Lead: Identifies root cause and implements fix Communications Lead: Updates stakeholders and customers Scribe: Documents timeline and actions taken

Phase 3: Mitigation

Mitigation is about stopping the bleeding fast, not necessarily fixing the root cause.

My Mitigation Playbook

For High Error Rates:

  1. Check recent deployments β†’ Rollback if needed

  2. Check external dependencies β†’ Enable circuit breakers

  3. Check resource usage β†’ Scale up if necessary

  4. Enable maintenance mode β†’ Show friendly error if all else fails

For Performance Degradation:

  1. Check database slow queries β†’ Kill long-running queries

  2. Check cache hit rate β†’ Warm cache or increase TTL

  3. Check CPU/memory β†’ Scale horizontally

  4. Enable graceful degradation β†’ Disable non-critical features

For Cascading Failures:

  1. Identify the initial failure point

  2. Break the cascade β†’ Circuit breakers, rate limiting

  3. Isolate the failing component

  4. Restore dependent services first

Fast Rollback Procedures

Circuit Breaker Activation

Feature Flag Kill Switch

Phase 4: Resolution

After mitigation, we fix the root cause permanently.

Root Cause Analysis

I use the "Five Whys" technique:

Example: Database connection exhaustion

  1. Why did the service go down? Database connection pool exhausted

  2. Why was the connection pool exhausted? Connections weren't being released after queries

  3. Why weren't connections being released? Error handling didn't include connection cleanup

  4. Why didn't we catch this in testing? Load tests didn't simulate sustained traffic patterns

  5. Why didn't load tests simulate realistic traffic? Load testing configuration outdated

Root cause: Inadequate load testing + missing error handling in connection management

Permanent Fix

Phase 5: Prevention (Postmortem)

The most important phase: learning from what went wrong.

Postmortem Template

Blameless Postmortems

Critical principle: Postmortems are blameless. We focus on systems and processes, not individuals.

Bad: "Bob introduced a bug that took down production" Good: "A code change introduced a connection leak. We should improve our testing to catch resource leaks."

We assume everyone acted with good intent given the information they had.

Postmortem Meeting

Within 48 hours of resolving a SEV1/SEV2, we hold a postmortem meeting:

Learning Library

All postmortems go into a searchable library:

Incident Communication

During incidents, communication is critical.

Status Page Updates

Internal Communication Template

Key Takeaways

  1. Detection speed matters: Invest in monitoring, alerting, and synthetic tests

  2. Triage quickly: Understand severity and impact before diving into fixes

  3. Mitigation first, root cause later: Stop the bleeding, then investigate

  4. Clear roles prevent chaos: IC, Tech Lead, Comms, Scribe

  5. Blameless postmortems: Focus on systems and processes, not people

  6. Learn and improve: Every incident should make you more resilient

In the final part, we'll cover operational excellence: creating runbooks, establishing on-call practices, and building documentation that actually helps during incidents.


Previous: Part 6: Service Reliability Metrics and Error Budgets Next: Part 8: Operational Excellence - Runbooks and On-Call Practices

Last updated