Part 7: Incident Response and Management

The 3 AM Wake-Up Call That Changed Everything

It was 3:17 AM when my phone started buzzing. PagerDuty alert: payment processing down. I fumbled for my laptop, VPN'd in, and stared at dashboards trying to understand what was happening. Twenty minutes of chaos later, I realized our database had run out of connections. By the time we fixed it, we'd lost $50,000 in failed transactions.

The worst part? Two weeks earlier, we'd seen warning signs in our metrics but didn't act on them. That incident taught me that incident response isn't just about fixing problems fast—it's about detecting them early, coordinating effectively, and learning from every failure.

The Four Phases of Incident Response

Every incident follows these phases:

Detection: Identifying that something is wrong
Triage: Understanding severity and impact
Mitigation: Stopping the bleeding
Resolution: Permanently fixing the root cause
Prevention: Ensuring it doesn't happen again (postmortem)

Let me walk through each phase with real processes I use.

Phase 1: Detection

The faster you detect incidents, the less damage they cause. I use multiple detection methods.

Automated Monitoring Alerts

Prometheus alerts notify me when things go wrong:

# prometheus-alerts/incident-detection.yml
groups:
- name: critical-incidents
  interval: 30s
  rules:
  
  # Service completely down
  - alert: ServiceDown
    expr: up{job="payment-api"} == 0
    for: 1m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "Payment API instance {{ $labels.instance }} has been down for more than 1 minute"
      runbook: "https://wiki.company.com/runbooks/service-down"
      
  # High error rate
  - alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))
      ) > 0.05
    for: 2m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High error rate detected"
      description: "{{ $value | humanizePercentage }} of requests are failing"
      runbook: "https://wiki.company.com/runbooks/high-error-rate"
      
  # Database connection pool exhausted
  - alert: DatabaseConnectionPoolExhausted
    expr: |
      (
        sum(db_connections_active) /
        sum(db_connections_max)
      ) > 0.9
    for: 5m
    labels:
      severity: warning
      team: platform
    annotations:
      summary: "Database connection pool near capacity"
      description: "{{ $value | humanizePercentage }} of database connections in use"
      runbook: "https://wiki.company.com/runbooks/db-connections"

Health Check Monitoring

Every service exposes health endpoints:

// src/health.ts
import { Router } from 'express';
import { database } from './database';
import { redis } from './cache';
import { messageQueue } from './queue';

export const healthRouter = Router();

// Liveness: Is the service alive?
healthRouter.get('/health/live', (req, res) => {
  res.status(200).json({
    status: 'ok',
    timestamp: new Date().toISOString()
  });
});

// Readiness: Can the service handle requests?
healthRouter.get('/health/ready', async (req, res) => {
  const checks = {
    database: false,
    cache: false,
    queue: false
  };
  
  try {
    // Check database
    await database.query('SELECT 1');
    checks.database = true;
    
    // Check Redis
    await redis.ping();
    checks.cache = true;
    
    // Check message queue
    await messageQueue.healthCheck();
    checks.queue = true;
    
    // All checks passed
    res.status(200).json({
      status: 'ready',
      checks,
      timestamp: new Date().toISOString()
    });
  } catch (error) {
    // Something failed
    res.status(503).json({
      status: 'not ready',
      checks,
      error: error.message,
      timestamp: new Date().toISOString()
    });
  }
});

Synthetic Monitoring

I run automated tests against production every minute:

// synthetic-tests/payment-flow.test.ts
import axios from 'axios';

async function syntheticTest() {
  const startTime = Date.now();
  
  try {
    // Step 1: Create test session
    const sessionResponse = await axios.post('https://api.myapp.com/sessions', {
      type: 'synthetic_test'
    });
    
    // Step 2: Attempt payment
    const paymentResponse = await axios.post('https://api.myapp.com/payments', {
      amount: 100,
      currency: 'USD',
      session_id: sessionResponse.data.id,
      test_mode: true
    });
    
    // Step 3: Verify payment status
    const statusResponse = await axios.get(
      `https://api.myapp.com/payments/${paymentResponse.data.id}`
    );
    
    const duration = Date.now() - startTime;
    
    // Record metrics
    recordMetric('synthetic_test_success', 1, { flow: 'payment' });
    recordMetric('synthetic_test_duration_ms', duration, { flow: 'payment' });
    
    if (duration > 5000) {
      sendAlert('Synthetic test taking too long', { duration });
    }
    
  } catch (error) {
    recordMetric('synthetic_test_failure', 1, { flow: 'payment' });
    sendAlert('Synthetic test failed', { error: error.message });
  }
}

// Run every minute
setInterval(syntheticTest, 60000);

User-Reported Issues

Sometimes users notice problems before our monitoring does. I integrate customer support tools:

// src/webhooks/zendesk.ts
import { Router } from 'express';
import { analyzeTicket } from '../services/ticket-analysis';

export const zendeskWebhookRouter = Router();

zendeskWebhookRouter.post('/webhooks/zendesk', async (req, res) => {
  const ticket = req.body;
  
  // Analyze ticket for incident keywords
  const keywords = ['down', 'error', 'broken', 'not working', 'timeout'];
  const containsIncidentKeyword = keywords.some(keyword => 
    ticket.description.toLowerCase().includes(keyword)
  );
  
  if (containsIncidentKeyword && ticket.priority === 'urgent') {
    // Multiple similar tickets in short time = potential incident
    const similarTickets = await checkForSimilarTickets(ticket, '15m');
    
    if (similarTickets.length >= 3) {
      await createIncident({
        title: `Potential incident: ${ticket.subject}`,
        description: `${similarTickets.length} similar support tickets in 15 minutes`,
        source: 'customer_reports',
        tickets: similarTickets.map(t => t.id)
      });
    }
  }
  
  res.sendStatus(200);
});

Phase 2: Triage

When an alert fires, the first step is understanding severity and impact.

Incident Severity Levels

I use four severity levels:

Severity

Definition

Response Time

Example

SEV1 (Critical)

Complete service outage or data loss

Immediate (< 5 min)

Payment processing down, database corruption

SEV2 (High)

Major feature degraded, workaround available

15 minutes

Checkout slow but functional, email delivery delayed

SEV3 (Medium)

Minor feature broken, limited user impact

1 hour

PDF export failing, search results incomplete

SEV4 (Low)

Cosmetic issue, no functional impact

Next business day

Logo misaligned, typo in email

Triage Checklist

When I receive an alert, I follow this checklist:

## Incident Triage Checklist

### Initial Assessment (0-5 minutes)
- [ ] Acknowledge alert in PagerDuty
- [ ] Check incident channel in Slack (#incidents)
- [ ] Verify issue is real (not false positive)
- [ ] Determine severity (SEV1-4)
- [ ] Identify affected services/users

### Scope Assessment (5-10 minutes)
- [ ] How many users affected? (partial/all)
- [ ] Which regions affected? (single/multi)
- [ ] Is this getting worse or contained?
- [ ] Are there cascading failures?
- [ ] Check recent deployments (last 24h)
- [ ] Check infrastructure changes

### Communication (10-15 minutes)
- [ ] Create incident in PagerDuty
- [ ] Post in #incidents channel
- [ ] Notify stakeholders (if SEV1/SEV2)
- [ ] Update status page (if customer-facing)

Incident Command Structure

For SEV1/SEV2 incidents, I assign roles:

Incident Commander (IC): Coordinates response, makes decisions Tech Lead: Identifies root cause and implements fix Communications Lead: Updates stakeholders and customers Scribe: Documents timeline and actions taken

# Slack incident channel template
@channel **INCIDENT DECLARED: SEV1**

**Issue**: Payment processing returning 503 errors

**Impact**: 100% of payment requests failing since 14:32 UTC

**Roles**:
- IC: @alice
- Tech Lead: @bob
- Comms: @charlie
- Scribe: @diana

**Status Page**: https://status.myapp.com
**Incident Doc**: https://docs.company.com/incidents/2026-02-17-001

Please join the war room: [Zoom link]

Phase 3: Mitigation

Mitigation is about stopping the bleeding fast, not necessarily fixing the root cause.

My Mitigation Playbook

For High Error Rates:

Check recent deployments → Rollback if needed
Check external dependencies → Enable circuit breakers
Check resource usage → Scale up if necessary
Enable maintenance mode → Show friendly error if all else fails

For Performance Degradation:

Check database slow queries → Kill long-running queries
Check cache hit rate → Warm cache or increase TTL
Check CPU/memory → Scale horizontally
Enable graceful degradation → Disable non-critical features

For Cascading Failures:

Identify the initial failure point
Break the cascade → Circuit breakers, rate limiting
Isolate the failing component
Restore dependent services first

Fast Rollback Procedures

#!/bin/bash
# scripts/emergency-rollback.sh

set -e

SERVICE=$1
NAMESPACE=${2:-production}

echo "🚨 EMERGENCY ROLLBACK: $SERVICE in $NAMESPACE"
echo "This will revert to the previous deployment"
read -p "Are you sure? (yes/no): " confirm

if [ "$confirm" != "yes" ]; then
  echo "Rollback cancelled"
  exit 0
fi

echo "📝 Recording rollback in incident log..."
# Log to incident tracking system
curl -X POST https://api.company.com/incidents/log \
  -d "{\"action\": \"rollback\", \"service\": \"$SERVICE\", \"operator\": \"$(whoami)\"}"

echo "⏪ Rolling back deployment..."
kubectl rollout undo deployment/$SERVICE -n $NAMESPACE

echo "⏳ Waiting for rollout to complete..."
kubectl rollout status deployment/$SERVICE -n $NAMESPACE --timeout=5m

echo "✅ Rollback complete"
echo "🔍 Check metrics: https://grafana.company.com/d/service-overview?var-service=$SERVICE"
echo "📊 Check logs: https://kibana.company.com/app/discover#/?_a=(query:(match:($SERVICE)))"

Circuit Breaker Activation

// scripts/circuit-breaker-control.ts
import { redis } from './lib/redis';

async function enableCircuitBreaker(
  service: string,
  duration: number = 300000 // 5 minutes default
) {
  await redis.set(
    `circuit-breaker:${service}:forced-open`,
    '1',
    'PX',
    duration
  );
  
  console.log(`✅ Circuit breaker for ${service} enabled for ${duration / 1000}s`);
  console.log(`   All requests to ${service} will fail fast`);
  console.log(`   To disable: redis-cli DEL circuit-breaker:${service}:forced-open`);
}

// Usage during incident
// enableCircuitBreaker('payment-gateway-service', 600000); // 10 minutes

Feature Flag Kill Switch

// src/feature-flags.ts
import { LaunchDarkly } from 'launchdarkly-node-server-sdk';

const ld = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY!);

export async function disableFeature(flagKey: string, reason: string) {
  // This would be done through LaunchDarkly UI during incident
  // Just showing how code respects flags
  
  const isEnabled = await ld.variation(flagKey, {
    key: 'system'
  }, false);
  
  console.log(`Feature ${flagKey} is ${isEnabled ? 'enabled' : 'disabled'}`);
  console.log(`Reason: ${reason}`);
}

// In application code
async function processPayment(data: PaymentData) {
  const newPaymentFlowEnabled = await ld.variation(
    'new-payment-flow',
    { key: data.userId },
    false
  );
  
  if (newPaymentFlowEnabled) {
    return processPaymentV2(data);  // New code path
  } else {
    return processPaymentV1(data);  // Old stable code path
  }
}

Phase 4: Resolution

After mitigation, we fix the root cause permanently.

Root Cause Analysis

I use the "Five Whys" technique:

Example: Database connection exhaustion

Why did the service go down? Database connection pool exhausted
Why was the connection pool exhausted? Connections weren't being released after queries
Why weren't connections being released? Error handling didn't include connection cleanup
Why didn't we catch this in testing? Load tests didn't simulate sustained traffic patterns
Why didn't load tests simulate realistic traffic? Load testing configuration outdated

Root cause: Inadequate load testing + missing error handling in connection management

Permanent Fix

// Before (buggy code)
async function getUser(id: string) {
  const connection = await pool.connect();
  const result = await connection.query('SELECT * FROM users WHERE id = $1', [id]);
  connection.release(); // ⚠️ Not called if query throws error
  return result.rows[0];
}

// After (fixed)
async function getUser(id: string) {
  const connection = await pool.connect();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = $1', [id]);
    return result.rows[0];
  } finally {
    connection.release(); // ✅ Always called
  }
}

// Even better: Use connection pooling library that handles this
async function getUser(id: string) {
  return database.one('SELECT * FROM users WHERE id = $1', [id]);
  // Library handles connection lifecycle
}

Phase 5: Prevention (Postmortem)

The most important phase: learning from what went wrong.

Postmortem Template

# Incident Postmortem: [YYYY-MM-DD] Database Connection Exhaustion

**Date**: 2026-02-17  
**Duration**: 23 minutes  
**Severity**: SEV1  
**Impact**: 100% of API requests failed  
**Incident Commander**: Alice Johnson  
**Responders**: Bob Smith, Charlie Davis

## Summary

On February 17, 2026 at 14:32 UTC, our API began returning 503 errors due to database connection pool exhaustion. All payment processing was down for 23 minutes, resulting in approximately $50,000 in lost revenue.

## Timeline (all times UTC)

| Time  | Event |
|-------|-------|
| 14:28 | Database connection pool usage reaches 70% |
| 14:30 | New deployment rolls out with connection leak |
| 14:32 | Connection pool reaches 100%, API starts failing | 
| 14:33 | PagerDuty alert fires, IC acknowledges |
| 14:35 | IC declares SEV1 incident, assembles response team |
| 14:37 | Team identifies database connection exhaustion |
| 14:40 | Decision made to rollback deployment |
| 14:42 | Rollback initiated |
| 14:48 | Rollback complete, connection pool draining |
| 14:52 | Connection pool back to healthy levels, service restored |
| 14:55 | SEV1 downgraded to SEV3 (monitoring) |
| 15:30 | Incident closed, all systems normal |

## Root Cause

A code change deployed at 14:30 introduced a bug where database connections weren't released when queries threw errors. Under normal load this wasn't noticeable, but a spike in traffic caused enough errors to exhaust the connection pool.

## What Went Well

- ✅ Monitoring detected the issue quickly (< 1 minute)
- ✅ Team assembled and coordinated effectively
- ✅ Rollback was fast and successful
- ✅ Communication with stakeholders was clear and timely

## What Went Wrong

- ❌ Bug wasn't caught in code review or testing
- ❌ Load tests didn't simulate realistic error scenarios
- ❌ No circuit breaker to prevent connection exhaustion
- ❌ No automated alerts for connection pool usage trends

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|---------|
| Fix connection leak in error handling | Bob | 2026-02-18 | ✅ Done |
| Add connection pool usage alerts | Alice | 2026-02-20 | ✅ Done |
| Update load tests to simulate errors | Charlie | 2026-02-24 | 🟡 In Progress |
| Implement connection pool circuit breaker | Diana | 2026-02-27 | 🔴 Not Started |
| Add connection leak tests to CI | Bob | 2026-03-01 | 🔴 Not Started |
| Update code review checklist (connection mgmt) | Alice | 2026-02-19 | ✅ Done |

## Lessons Learned

1. **Error paths need as much attention as happy paths**  
   The bug only manifested during errors, which we didn't test thoroughly

2. **Observability needs leading indicators, not just lagging**  
   We got alerted when connections hit 100%, but should have alerted at 70%

3. **Load testing must include failure scenarios**  
   Our tests assumed everything succeeded, missing this entire class of bugs

4. **Circuit breakers prevent resource exhaustion**  
   If we'd had a circuit breaker on the database connection pool, the blast radius would have been smaller

## References

- Incident Doc: https://docs.company.com/incidents/2026-02-17-001
- PR with fix: https://github.com/company/api/pull/1234
- Code review checklist update: https://docs.company.com/checklists/code-review

Blameless Postmortems

Critical principle: Postmortems are blameless. We focus on systems and processes, not individuals.

Bad: "Bob introduced a bug that took down production" Good: "A code change introduced a connection leak. We should improve our testing to catch resource leaks."

We assume everyone acted with good intent given the information they had.

Postmortem Meeting

Within 48 hours of resolving a SEV1/SEV2, we hold a postmortem meeting:

## Postmortem Meeting Agenda

**Attendance**: Response team + engineering leadership

1. **Review timeline** (10 min)  
   Walk through what happened when

2. **Discuss root cause** (15 min)  
   Five Whys analysis, contributing factors

3. **What went well** (10 min)  
   Processes that worked, decisions that helped

4. **What needs improvement** (20 min)  
   Gaps in monitoring, testing, process, tooling

5. **Define action items** (15 min)  
   Specific, assigned, time-boxed improvements

6. **Close** (5 min)  
   Confirm postmortem will be published, schedule follow-up

Learning Library

All postmortems go into a searchable library:

// src/models/postmortem.ts
interface Postmortem {
  id: string;
  date: Date;
  title: string;
  severity: 'SEV1' | 'SEV2' | 'SEV3' | 'SEV4';
  duration: number; // minutes
  impact: string;
  rootCause: string;
  actionItems: ActionItem[];
  tags: string[]; // 'database', 'deployment', 'network', etc.
  documentUrl: string;
}

// Searchable by tag or text
async function searchPostmortems(query: string): Promise<Postmortem[]> {
  return database.query(`
    SELECT * FROM postmortems
    WHERE to_tsvector('english', title || ' ' || root_cause) @@ plainto_tsquery('english', $1)
    ORDER BY date DESC
  `, [query]);
}

// When similar incident occurs, surface relevant past postmortems
async function findSimilarIncidents(currentIncident: Incident): Promise<Postmortem[]> {
  return searchPostmortems(currentIncident.symptoms.join(' '));
}

Incident Communication

During incidents, communication is critical.

Status Page Updates

// src/services/status-page.ts
import { Statuspage } from 'statuspage.io';

const statuspage = new Statuspage(process.env.STATUSPAGE_API_KEY);

export async function createIncident(data: {
  title: string;
  impact: 'none' | 'minor' | 'major' | 'critical';
  body: string;
}) {
  const incident = await statuspage.incidents.create({
    name: data.title,
    status: 'investigating',
    impact: data.impact,
    body: data.body,
    component_ids: ['payment-api'], // Affected components
    deliver_notifications: true
  });
  
  return incident;
}

export async function updateIncident(incidentId: string, message: string, status: string) {
  await statuspage.incidents.update(incidentId, {
    status,
    body: message
  });
}

// Example usage during incident
await createIncident({
  title: 'Payment Processing Degraded',
  impact: 'major',
  body: 'We are investigating reports of slow payment processing. Our team is actively working on this.'
});

// Later...
await updateIncident(incident.id, 
  'We have identified the issue and are rolling back a recent deployment.',
  'identified'
);

await updateIncident(incident.id,
  'The issue has been resolved. Payment processing is back to normal.',
  'resolved'
);

Internal Communication Template

📢 **INCIDENT UPDATE - T+15 minutes**

**Status**: Mitigation in progress  
**Impact**: Payment processing at 50% capacity  
**Current Action**: Rolling back deployment v2.5.1  
**ETA**: 5-10 minutes  
**Next Update**: T+25 minutes or when resolved

**Details**: Rollback initiated at 14:40. Kubernetes reporting 4/10 pods on old version, 6/10 on new. Traffic shifting to healthy pods.

**Help Needed**: None at this time

**IC**: @alice

Key Takeaways

Detection speed matters: Invest in monitoring, alerting, and synthetic tests
Triage quickly: Understand severity and impact before diving into fixes
Mitigation first, root cause later: Stop the bleeding, then investigate
Clear roles prevent chaos: IC, Tech Lead, Comms, Scribe
Blameless postmortems: Focus on systems and processes, not people
Learn and improve: Every incident should make you more resilient

In the final part, we'll cover operational excellence: creating runbooks, establishing on-call practices, and building documentation that actually helps during incidents.

Previous: Part 6: Service Reliability Metrics and Error Budgets Next: Part 8: Operational Excellence - Runbooks and On-Call Practices

PreviousPart 6: Service Reliability Metrics NextPart 8: Operational Excellence

Last updated 19 hours ago

hashtagThe 3 AM Wake-Up Call That Changed Everything

hashtagThe Four Phases of Incident Response

hashtagPhase 1: Detection

hashtagAutomated Monitoring Alerts

hashtagHealth Check Monitoring

hashtagSynthetic Monitoring

hashtagUser-Reported Issues

hashtagPhase 2: Triage

hashtagIncident Severity Levels

hashtagTriage Checklist

hashtagIncident Command Structure

hashtagPhase 3: Mitigation

hashtagMy Mitigation Playbook

hashtagFast Rollback Procedures

hashtagCircuit Breaker Activation

hashtagFeature Flag Kill Switch

hashtagPhase 4: Resolution

hashtagRoot Cause Analysis

hashtagPermanent Fix

hashtagPhase 5: Prevention (Postmortem)

hashtagPostmortem Template

hashtagBlameless Postmortems

hashtagPostmortem Meeting

hashtagLearning Library

hashtagIncident Communication

hashtagStatus Page Updates

hashtagInternal Communication Template

hashtagKey Takeaways