It was 3:17 AM when my phone started buzzing. PagerDuty alert: payment processing down. I fumbled for my laptop, VPN'd in, and stared at dashboards trying to understand what was happening. Twenty minutes of chaos later, I realized our database had run out of connections. By the time we fixed it, we'd lost $50,000 in failed transactions.
The worst part? Two weeks earlier, we'd seen warning signs in our metrics but didn't act on them. That incident taught me that incident response isn't just about fixing problems fastβit's about detecting them early, coordinating effectively, and learning from every failure.
The Four Phases of Incident Response
Every incident follows these phases:
Detection: Identifying that something is wrong
Triage: Understanding severity and impact
Mitigation: Stopping the bleeding
Resolution: Permanently fixing the root cause
Prevention: Ensuring it doesn't happen again (postmortem)
Let me walk through each phase with real processes I use.
Phase 1: Detection
The faster you detect incidents, the less damage they cause. I use multiple detection methods.
Automated Monitoring Alerts
Prometheus alerts notify me when things go wrong:
Health Check Monitoring
Every service exposes health endpoints:
Synthetic Monitoring
I run automated tests against production every minute:
User-Reported Issues
Sometimes users notice problems before our monitoring does. I integrate customer support tools:
Phase 2: Triage
When an alert fires, the first step is understanding severity and impact.
Incident Severity Levels
I use four severity levels:
Severity
Definition
Response Time
Example
SEV1 (Critical)
Complete service outage or data loss
Immediate (< 5 min)
Payment processing down, database corruption
SEV2 (High)
Major feature degraded, workaround available
15 minutes
Checkout slow but functional, email delivery delayed
SEV3 (Medium)
Minor feature broken, limited user impact
1 hour
PDF export failing, search results incomplete
SEV4 (Low)
Cosmetic issue, no functional impact
Next business day
Logo misaligned, typo in email
Triage Checklist
When I receive an alert, I follow this checklist:
Incident Command Structure
For SEV1/SEV2 incidents, I assign roles:
Incident Commander (IC): Coordinates response, makes decisions Tech Lead: Identifies root cause and implements fix Communications Lead: Updates stakeholders and customers Scribe: Documents timeline and actions taken
Phase 3: Mitigation
Mitigation is about stopping the bleeding fast, not necessarily fixing the root cause.
The most important phase: learning from what went wrong.
Postmortem Template
Blameless Postmortems
Critical principle: Postmortems are blameless. We focus on systems and processes, not individuals.
Bad: "Bob introduced a bug that took down production"
Good: "A code change introduced a connection leak. We should improve our testing to catch resource leaks."
We assume everyone acted with good intent given the information they had.
Postmortem Meeting
Within 48 hours of resolving a SEV1/SEV2, we hold a postmortem meeting:
Learning Library
All postmortems go into a searchable library:
Incident Communication
During incidents, communication is critical.
Status Page Updates
Internal Communication Template
Key Takeaways
Detection speed matters: Invest in monitoring, alerting, and synthetic tests
Triage quickly: Understand severity and impact before diving into fixes
Mitigation first, root cause later: Stop the bleeding, then investigate
Blameless postmortems: Focus on systems and processes, not people
Learn and improve: Every incident should make you more resilient
In the final part, we'll cover operational excellence: creating runbooks, establishing on-call practices, and building documentation that actually helps during incidents.
#!/bin/bash
# scripts/emergency-rollback.sh
set -e
SERVICE=$1
NAMESPACE=${2:-production}
echo "π¨ EMERGENCY ROLLBACK: $SERVICE in $NAMESPACE"
echo "This will revert to the previous deployment"
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
echo "Rollback cancelled"
exit 0
fi
echo "π Recording rollback in incident log..."
# Log to incident tracking system
curl -X POST https://api.company.com/incidents/log \
-d "{\"action\": \"rollback\", \"service\": \"$SERVICE\", \"operator\": \"$(whoami)\"}"
echo "βͺ Rolling back deployment..."
kubectl rollout undo deployment/$SERVICE -n $NAMESPACE
echo "β³ Waiting for rollout to complete..."
kubectl rollout status deployment/$SERVICE -n $NAMESPACE --timeout=5m
echo "β Rollback complete"
echo "π Check metrics: https://grafana.company.com/d/service-overview?var-service=$SERVICE"
echo "π Check logs: https://kibana.company.com/app/discover#/?_a=(query:(match:($SERVICE)))"
// scripts/circuit-breaker-control.ts
import { redis } from './lib/redis';
async function enableCircuitBreaker(
service: string,
duration: number = 300000 // 5 minutes default
) {
await redis.set(
`circuit-breaker:${service}:forced-open`,
'1',
'PX',
duration
);
console.log(`β Circuit breaker for ${service} enabled for ${duration / 1000}s`);
console.log(` All requests to ${service} will fail fast`);
console.log(` To disable: redis-cli DEL circuit-breaker:${service}:forced-open`);
}
// Usage during incident
// enableCircuitBreaker('payment-gateway-service', 600000); // 10 minutes
// src/feature-flags.ts
import { LaunchDarkly } from 'launchdarkly-node-server-sdk';
const ld = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY!);
export async function disableFeature(flagKey: string, reason: string) {
// This would be done through LaunchDarkly UI during incident
// Just showing how code respects flags
const isEnabled = await ld.variation(flagKey, {
key: 'system'
}, false);
console.log(`Feature ${flagKey} is ${isEnabled ? 'enabled' : 'disabled'}`);
console.log(`Reason: ${reason}`);
}
// In application code
async function processPayment(data: PaymentData) {
const newPaymentFlowEnabled = await ld.variation(
'new-payment-flow',
{ key: data.userId },
false
);
if (newPaymentFlowEnabled) {
return processPaymentV2(data); // New code path
} else {
return processPaymentV1(data); // Old stable code path
}
}
// Before (buggy code)
async function getUser(id: string) {
const connection = await pool.connect();
const result = await connection.query('SELECT * FROM users WHERE id = $1', [id]);
connection.release(); // β οΈ Not called if query throws error
return result.rows[0];
}
// After (fixed)
async function getUser(id: string) {
const connection = await pool.connect();
try {
const result = await connection.query('SELECT * FROM users WHERE id = $1', [id]);
return result.rows[0];
} finally {
connection.release(); // β Always called
}
}
// Even better: Use connection pooling library that handles this
async function getUser(id: string) {
return database.one('SELECT * FROM users WHERE id = $1', [id]);
// Library handles connection lifecycle
}
# Incident Postmortem: [YYYY-MM-DD] Database Connection Exhaustion
**Date**: 2026-02-17
**Duration**: 23 minutes
**Severity**: SEV1
**Impact**: 100% of API requests failed
**Incident Commander**: Alice Johnson
**Responders**: Bob Smith, Charlie Davis
## Summary
On February 17, 2026 at 14:32 UTC, our API began returning 503 errors due to database connection pool exhaustion. All payment processing was down for 23 minutes, resulting in approximately $50,000 in lost revenue.
## Timeline (all times UTC)
| Time | Event |
|-------|-------|
| 14:28 | Database connection pool usage reaches 70% |
| 14:30 | New deployment rolls out with connection leak |
| 14:32 | Connection pool reaches 100%, API starts failing |
| 14:33 | PagerDuty alert fires, IC acknowledges |
| 14:35 | IC declares SEV1 incident, assembles response team |
| 14:37 | Team identifies database connection exhaustion |
| 14:40 | Decision made to rollback deployment |
| 14:42 | Rollback initiated |
| 14:48 | Rollback complete, connection pool draining |
| 14:52 | Connection pool back to healthy levels, service restored |
| 14:55 | SEV1 downgraded to SEV3 (monitoring) |
| 15:30 | Incident closed, all systems normal |
## Root Cause
A code change deployed at 14:30 introduced a bug where database connections weren't released when queries threw errors. Under normal load this wasn't noticeable, but a spike in traffic caused enough errors to exhaust the connection pool.
## What Went Well
- β Monitoring detected the issue quickly (< 1 minute)
- β Team assembled and coordinated effectively
- β Rollback was fast and successful
- β Communication with stakeholders was clear and timely
## What Went Wrong
- β Bug wasn't caught in code review or testing
- β Load tests didn't simulate realistic error scenarios
- β No circuit breaker to prevent connection exhaustion
- β No automated alerts for connection pool usage trends
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|---------|
| Fix connection leak in error handling | Bob | 2026-02-18 | β Done |
| Add connection pool usage alerts | Alice | 2026-02-20 | β Done |
| Update load tests to simulate errors | Charlie | 2026-02-24 | π‘ In Progress |
| Implement connection pool circuit breaker | Diana | 2026-02-27 | π΄ Not Started |
| Add connection leak tests to CI | Bob | 2026-03-01 | π΄ Not Started |
| Update code review checklist (connection mgmt) | Alice | 2026-02-19 | β Done |
## Lessons Learned
1. **Error paths need as much attention as happy paths**
The bug only manifested during errors, which we didn't test thoroughly
2. **Observability needs leading indicators, not just lagging**
We got alerted when connections hit 100%, but should have alerted at 70%
3. **Load testing must include failure scenarios**
Our tests assumed everything succeeded, missing this entire class of bugs
4. **Circuit breakers prevent resource exhaustion**
If we'd had a circuit breaker on the database connection pool, the blast radius would have been smaller
## References
- Incident Doc: https://docs.company.com/incidents/2026-02-17-001
- PR with fix: https://github.com/company/api/pull/1234
- Code review checklist update: https://docs.company.com/checklists/code-review
## Postmortem Meeting Agenda
**Attendance**: Response team + engineering leadership
1. **Review timeline** (10 min)
Walk through what happened when
2. **Discuss root cause** (15 min)
Five Whys analysis, contributing factors
3. **What went well** (10 min)
Processes that worked, decisions that helped
4. **What needs improvement** (20 min)
Gaps in monitoring, testing, process, tooling
5. **Define action items** (15 min)
Specific, assigned, time-boxed improvements
6. **Close** (5 min)
Confirm postmortem will be published, schedule follow-up
// src/models/postmortem.ts
interface Postmortem {
id: string;
date: Date;
title: string;
severity: 'SEV1' | 'SEV2' | 'SEV3' | 'SEV4';
duration: number; // minutes
impact: string;
rootCause: string;
actionItems: ActionItem[];
tags: string[]; // 'database', 'deployment', 'network', etc.
documentUrl: string;
}
// Searchable by tag or text
async function searchPostmortems(query: string): Promise<Postmortem[]> {
return database.query(`
SELECT * FROM postmortems
WHERE to_tsvector('english', title || ' ' || root_cause) @@ plainto_tsquery('english', $1)
ORDER BY date DESC
`, [query]);
}
// When similar incident occurs, surface relevant past postmortems
async function findSimilarIncidents(currentIncident: Incident): Promise<Postmortem[]> {
return searchPostmortems(currentIncident.symptoms.join(' '));
}
// src/services/status-page.ts
import { Statuspage } from 'statuspage.io';
const statuspage = new Statuspage(process.env.STATUSPAGE_API_KEY);
export async function createIncident(data: {
title: string;
impact: 'none' | 'minor' | 'major' | 'critical';
body: string;
}) {
const incident = await statuspage.incidents.create({
name: data.title,
status: 'investigating',
impact: data.impact,
body: data.body,
component_ids: ['payment-api'], // Affected components
deliver_notifications: true
});
return incident;
}
export async function updateIncident(incidentId: string, message: string, status: string) {
await statuspage.incidents.update(incidentId, {
status,
body: message
});
}
// Example usage during incident
await createIncident({
title: 'Payment Processing Degraded',
impact: 'major',
body: 'We are investigating reports of slow payment processing. Our team is actively working on this.'
});
// Later...
await updateIncident(incident.id,
'We have identified the issue and are rolling back a recent deployment.',
'identified'
);
await updateIncident(incident.id,
'The issue has been resolved. Payment processing is back to normal.',
'resolved'
);
π’ **INCIDENT UPDATE - T+15 minutes**
**Status**: Mitigation in progress
**Impact**: Payment processing at 50% capacity
**Current Action**: Rolling back deployment v2.5.1
**ETA**: 5-10 minutes
**Next Update**: T+25 minutes or when resolved
**Details**: Rollback initiated at 14:40. Kubernetes reporting 4/10 pods on old version, 6/10 on new. Traffic shifting to healthy pods.
**Help Needed**: None at this time
**IC**: @alice