Part 4: Incident Management - From Chaos to Coordinated Response

What You'll Learn: This article shares my journey from panicking during production incidents to building a calm, systematic response process. You'll learn how to define incident severity levels, set up on-call rotations for small teams, create an effective incident response workflow, write blameless post-mortems that drive improvement, and build runbooks that actually help during the chaos. By the end, you'll have a framework for managing incidents professionally instead of frantically.

My First Real Production Incident

It was 10:47 PM on a Saturday when my phone started buzzing. My personal project - a Go-based expense tracking API that about 50 friends used - was completely down. Users couldn't access the app, and I had no idea why.

I SSH'd into my DigitalOcean droplet in a panic. My hands were shaking as I ran commands randomly:

# What's running?
ps aux | grep expense-api

# Is there disk space?
df -h

# What about logs?
tail /var/log/expense-api.log

# Maybe just restart it?
systemctl restart expense-api

The service came back up for 30 seconds, then crashed again. I spent two hours in this panicked cycle:

Restart service
Watch it crash
Guess at the problem
Try a random fix
Repeat

Finally, at 1 AM, I discovered the issue: my database ran out of disk space because I wasn't rotating logs. A problem that should have taken 5 minutes to fix took over 2 hours because I had no process.

That night taught me that having a systematic incident response process matters more than technical skill when you're under pressure.

What is an Incident?

After that chaotic experience, I defined what constitutes an "incident" for my services:

An incident is any event that causes or could cause:

Service degradation that impacts users
Violation of SLOs
Security breach or data exposure
Significant risk to system stability

Not incidents (these are just operational work):

Planned maintenance
Expected behavior (like rate limiting)
Single user issues
Development environment problems

This definition helps me decide: "Is this worth paging someone at 2 AM?"

Incident Severity Levels

I learned to categorize incidents by severity, which determines response urgency. Here's the framework I use:

SEV-1 (Critical)

Impact: Complete service outage or critical security breach Response: Immediate, all hands on deck Examples:

API completely down
Database corrupted
Active security breach
Payment processing failing

My response:

Page on-call immediately
Update status page within 5 minutes
All hands working to resolve
Executive communication if prolonged

SEV-2 (High)

Impact: Major feature degraded, affecting many users Response: Urgent, within 30 minutes Examples:

50% error rate on critical endpoint
Authentication slow (5+ seconds)
File uploads failing

My response:

Page on-call within 30 minutes
Update status page
Focused team working to resolve
Can escalate to SEV-1 if worsening

SEV-3 (Medium)

Impact: Minor degradation, limited user impact Response: During business hours Examples:

Non-critical feature failing
Slow performance on rarely-used endpoint
Minor data inconsistency

My response:

Create ticket, investigate during business hours
Fix within 24-48 hours
Monitor to ensure it doesn't worsen

SEV-4 (Low)

Impact: Cosmetic issues, no user impact Response: Backlog Examples:

Typo in log message
Metrics dashboard formatting issue
Documentation outdated

My response:

Add to backlog
Fix when convenient

Setting Up On-Call: A Solo Developer's Approach

When I first started, I was the only person supporting my services. Here's how I made on-call sustainable:

1. Define On-Call Expectations

I documented what "on-call" means for me:

## On-Call Responsibilities

**Response Time:**
- SEV-1: Acknowledge within 5 minutes
- SEV-2: Acknowledge within 30 minutes
- SEV-3: Next business day

**After Hours:**
- Carry charged phone
- Laptop nearby
- Internet access available
- Limit alcohol consumption

**Escalation:**
- If I can't resolve in 1 hour, escalate to [backup contact]
- If security incident, immediately contact [security lead]

2. Set Up Alerting

I use PagerDuty to manage on-call, but you can start with basic email/SMS alerts.

// pkg/alerting/pagerduty.go
package alerting

import (
    "bytes"
    "encoding/json"
    "fmt"
    "net/http"
    "time"
)

type PagerDutyClient struct {
    integrationKey string
    httpClient     *http.Client
}

func NewPagerDutyClient(integrationKey string) *PagerDutyClient {
    return &PagerDutyClient{
        integrationKey: integrationKey,
        httpClient:     &http.Client{Timeout: 10 * time.Second},
    }
}

type Event struct {
    RoutingKey  string      `json:"routing_key"`
    EventAction string      `json:"event_action"` // trigger, acknowledge, resolve
    DedupKey    string      `json:"dedup_key"`
    Payload     EventDetail `json:"payload"`
}

type EventDetail struct {
    Summary   string `json:"summary"`
    Source    string `json:"source"`
    Severity  string `json:"severity"` // critical, error, warning, info
    Timestamp string `json:"timestamp,omitempty"`
}

func (p *PagerDutyClient) TriggerIncident(summary, source, severity string) error {
    event := Event{
        RoutingKey:  p.integrationKey,
        EventAction: "trigger",
        DedupKey:    fmt.Sprintf("%s-%d", source, time.Now().Unix()),
        Payload: EventDetail{
            Summary:   summary,
            Source:    source,
            Severity:  severity,
            Timestamp: time.Now().UTC().Format(time.RFC3339),
        },
    }

    body, err := json.Marshal(event)
    if err != nil {
        return err
    }

    resp, err := p.httpClient.Post(
        "https://events.pagerduty.com/v2/enqueue",
        "application/json",
        bytes.NewBuffer(body),
    )
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusAccepted {
        return fmt.Errorf("PagerDuty returned %d", resp.StatusCode)
    }

    return nil
}

3. Alert on What Matters

I learned the hard way: alert only on user-impacting issues, not every anomaly.

Bad alerts (noise):

# Alert if any error occurs
- alert: AnyError
  expr: rate(errors_total[5m]) > 0

Good alerts (actionable):

# Alert if error rate violates SLO
- alert: ErrorBudgetBurnRateCritical
  expr: |
    (
      1 - (sum(rate(http_requests_total{status!~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h])))
    ) > (0.001 * 5)  # Burning 5% of monthly budget per hour
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Critical error rate - violating SLO"
    description: "Error rate is {{ $value }}, burning error budget rapidly"

My Incident Response Workflow

After several chaotic incidents, I developed this systematic workflow:

Phase 1: Detection (T+0 to T+5 minutes)

Goal: Recognize and acknowledge the incident

1. Alert fires
2. On-call engineer acknowledges
3. Check dashboards to verify impact
4. Declare incident severity

My checklist:

Acknowledge alert in PagerDuty
Check main dashboard
Verify with users if unclear
Declare severity level

Phase 2: Response (T+5 to resolution)

Goal: Stop the bleeding and restore service

1. Start incident log/timeline
2. Assess impact and root cause
3. Implement mitigation
4. Monitor for stability

My response template:

## Incident #42 - API Outage

**Severity**: SEV-1
**Detected**: 2024-02-06 22:47 UTC
**Incident Commander**: Me

### Timeline
- 22:47 - Alert: API response time > 10s
- 22:52 - Confirmed: Database connection pool exhausted
- 22:55 - Mitigation: Increased max connections from 25 to 50
- 23:02 - Verified: Response times back to normal
- 23:15 - Resolved: Incident closed

### Impact
- Duration: 28 minutes
- Error rate: 45% during peak
- Users affected: ~200 requests failed

### Root Cause
Database connection pool too small for traffic spike.

### Actions
- [x] Immediate: Increase connection pool
- [ ] Tomorrow: Review connection pool sizing
- [ ] This week: Add connection pool metrics to dashboard

Phase 3: Communication

Goal: Keep stakeholders informed

For my personal projects, "stakeholders" means users. I post updates to a simple status page.

// Simple status page update
type StatusUpdate struct {
    Timestamp time.Time `json:"timestamp"`
    Status    string    `json:"status"` // investigating, identified, monitoring, resolved
    Message   string    `json:"message"`
}

func postStatusUpdate(status, message string) error {
    update := StatusUpdate{
        Timestamp: time.Now(),
        Status:    status,
        Message:   message,
    }
    
    // Post to status page, Twitter, etc.
    return publishUpdate(update)
}

My communication cadence:

Initial update: Within 5 minutes of detection
Progress updates: Every 30 minutes during active incident
Resolution update: When service is stable
Post-mortem: Within 48 hours

Phase 4: Resolution

Goal: Confirm stability and close incident

1. Verify metrics are normal
2. Monitor for 30-60 minutes
3. Close incident
4. Schedule post-mortem

Building Runbooks

A runbook is a step-by-step guide for handling specific scenarios. They're lifesavers during incidents when your brain is foggy from stress or sleep deprivation.

Runbook Example: Database Connection Pool Exhausted

# Runbook: Database Connection Pool Exhausted

## Symptoms
- High latency on database queries
- Errors: "cannot acquire connection from pool"
- Metric: `db_connections_waiting` > 0

## Impact
- SEV-1 if affecting critical paths
- SEV-2 if affecting non-critical features

## Diagnosis

### 1. Check connection pool metrics
```bash
# Grafana dashboard: "Database Health"
# Or query Prometheus directly:
curl 'http://prometheus:9090/api/v1/query?query=db_connections_in_use'
curl 'http://prometheus:9090/api/v1/query?query=db_connections_waiting'

2. Check for slow queries

-- Find queries running longer than 5 seconds
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
  AND state = 'active';

Immediate Mitigation

Option 1: Increase connection pool (quick fix)

# SSH to application server
ssh app-server-01

# Edit environment variable
export DB_MAX_CONNECTIONS=50  # Was 25

# Restart application
systemctl restart expense-api

# Verify
curl localhost:8080/health/ready

Option 2: Kill long-running queries

-- Kill specific query
SELECT pg_terminate_backend(PID);

Verification

db_connections_waiting returns to 0
API latency P95 < 200ms
Error rate < 0.1%

Long-term Fix

Review and optimize slow queries
Rightsize connection pool based on traffic
Add alerting for db_connections_waiting > 5

Runbook: High Database CPU
Runbook: Slow Query Investigation


### Runbook Template

I created a template for building new runbooks:

```markdown
# Runbook: [Problem Name]

## Symptoms
How do you know this is happening?

## Impact
What severity level? Who is affected?

## Diagnosis
How to confirm this is the issue?

## Immediate Mitigation
Steps to stop the bleeding (not permanent fix)

## Verification
How to confirm the mitigation worked?

## Long-term Fix
Permanent solution to prevent recurrence

## Related
Links to related runbooks, docs, code

Post-Mortem: Learning from Incidents

The most valuable part of incident management isn't the response - it's what you learn afterwards. I write a post-mortem for every SEV-1 and SEV-2 incident.

My Post-Mortem Template

# Post-Mortem: [Incident Name]

**Date**: 2024-02-06  
**Author**: Your Name  
**Severity**: SEV-1  
**Duration**: 28 minutes  

## Summary
One-paragraph executive summary of what happened.

## Impact
- **Users affected**: ~200 failed requests
- **Revenue impact**: N/A (free service)
- **SLO impact**: Used 15% of monthly error budget

## Timeline (All times UTC)

| Time | Event |
|------|-------|
| 22:47 | Alert fired: API latency > 10s |
| 22:50 | On-call acknowledged |
| 22:52 | Identified root cause: DB pool exhausted |
| 22:55 | Mitigation deployed: Increased pool size |
| 23:02 | Service recovered |
| 23:15 | Incident closed |

## Root Cause

The database connection pool was configured for 25 max connections, which was sufficient for normal traffic (~50 req/s). However, a traffic spike to 200 req/s caused:

1. All 25 connections to be in use
2. New requests waiting for available connections
3. Request timeouts after 10s
4. Cascading failures as retries compounded the problem

**Why did this happen?**
- Connection pool size was never load tested
- No alerting on `db_connections_waiting`
- No automatic scaling of connection pool

## What Went Well
- ✅ Alert fired quickly (within 30 seconds of issue)
- ✅ Root cause identified in 5 minutes
- ✅ Mitigation deployed in 8 minutes
- ✅ Clear dashboards showed the problem

## What Went Wrong
- ❌ No proactive monitoring of connection pool saturation
- ❌ Connection pool size not sized for peak traffic
- ❌ No load testing before deploying to production

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add alert for `db_connections_waiting > 5` | Me | 2024-02-08 | P0 |
| Load test with 3x peak traffic | Me | 2024-02-10 | P0 |
| Document connection pool sizing guide | Me | 2024-02-12 | P1 |
| Add auto-scaling for connection pool | Me | 2024-02-20 | P2 |
| Add runbook for connection pool issues | Me | 2024-02-15 | P1 |

## Lessons Learned
1. **Always load test**: Assumed capacity is dangerous
2. **Monitor saturation**: Waited until exhaustion instead of catching early
3. **Runbooks matter**: Would have resolved faster with a runbook

Blameless Culture

The most important lesson I learned: never blame people in post-mortems.

Bad post-mortem:

"Engineer X forgot to load test the connection pool change."

Good post-mortem:

"Connection pool sizing wasn't validated under load because we lacked a load testing process in our deployment checklist."

The goal is to identify system failures, not human failures. If a human made a mistake, ask: "What process failed that allowed this mistake?"

Incident Response Toolkit

Here are the tools I use during incidents:

1. Incident Command Checklist

## Incident Commander Checklist

### Immediately
- [ ] Acknowledge alert
- [ ] Verify impact (check dashboards)
- [ ] Declare severity
- [ ] Start incident timeline doc

### Within 5 minutes
- [ ] Post initial status update
- [ ] Assign roles if multi-person response
- [ ] Begin diagnosis

### During incident
- [ ] Update timeline every major event
- [ ] Post status updates every 30min
- [ ] Keep team focused on mitigation (not root cause)

### After mitigation
- [ ] Monitor for 30-60 minutes
- [ ] Post resolution update
- [ ] Close incident
- [ ] Schedule post-mortem (within 48 hours)

2. Incident Chat Template

I use Slack for incident coordination. Here's my template:

🚨 INCIDENT DECLARED 🚨

**Severity**: SEV-1
**Impact**: API completely down, all users affected
**Detected**: 2024-02-06 22:47 UTC
**IC (Incident Commander)**: @yourname

**Roles**:
- IC: @yourname
- Comms: @teammate (status updates)
- Ops: @teammate (implement fixes)

**Timeline doc**: [Link]
**Dashboard**: [Link]

---
Thread below for updates

3. Quick Commands Script

I keep a script of common diagnostic commands:

#!/bin/bash
# incident-diag.sh - Quick diagnostic commands

echo "=== Service Status ==="
systemctl status expense-api

echo "=== Recent Logs ==="
journalctl -u expense-api -n 50 --no-pager

echo "=== Resource Usage ==="
top -bn1 | head -20

echo "=== Disk Space ==="
df -h

echo "=== Database Connections ==="
psql -c "SELECT count(*) FROM pg_stat_activity;"

echo "=== Network Connections ==="
ss -tun | grep :8080 | wc -l

echo "=== Recent Errors ==="
grep ERROR /var/log/expense-api.log | tail -20

Growing Beyond Solo On-Call

As my services grew and I added teammates, I evolved my on-call approach:

Rotation Schedule

Week 1: Person A (primary), Person B (secondary)
Week 2: Person B (primary), Person C (secondary)
Week 3: Person C (primary), Person A (secondary)

Escalation Policy

1. Alert fires
2. Primary on-call paged (acknowledge within 5 min)
3. If no ack, escalate to secondary (after 10 min)
4. If no ack, escalate to team lead (after 20 min)

On-Call Handoff

Every Monday morning, we do a 15-minute handoff:

## On-Call Handoff - Week of Feb 6

**Outgoing**: Person A  
**Incoming**: Person B

### Last Week Summary
- 3 alerts total
- 1 SEV-2 incident (database connection pool)
- 2 false positives (need to adjust threshold)

### Ongoing Issues
- Database CPU usage trending up (monitoring)
- Planning to add index on transactions table

### Action Items for New On-Call
- [ ] Review post-mortem from database incident
- [ ] Test new runbook for connection pool issues

### Known Upcoming
- Deploying new feature Tuesday (potentially higher traffic)

Common Mistakes I Made

Mistake 1: Not Declaring Incidents Soon Enough

Early on, I'd try to fix issues quietly without declaring an incident. This led to:

Delayed response
No communication to users
No post-mortem or learning

Fix: When in doubt, declare an incident. You can always downgrade severity.

Mistake 2: Focusing on Root Cause During Active Incident

I'd waste time during outages trying to understand exactly why something failed instead of just fixing it.

Fix: During active incident, focus on mitigation. Root cause analysis comes later in the post-mortem.

Mistake 3: No Runbooks

I'd try to remember how to fix issues from memory, often forgetting critical steps.

Fix: Write a runbook after every incident. Next time it happens, you have a checklist.

Mistake 4: Skipping Post-Mortems

"I know what went wrong, I'll just fix it."

Fix: Always write post-mortems for SEV-1/SEV-2. The act of writing surfaces insights you'd otherwise miss.

Key Takeaways

Have a process before you need it. When you're stressed during an incident, you fall back to your processes. Make them good.
Severity levels drive response urgency. Not everything is SEV-1. Appropriate classification prevents burnout.
Runbooks are force multipliers. A good runbook means anyone can handle the incident, not just the person who built it.
Blameless post-mortems drive improvement. Blame people, and they hide mistakes. Blame systems, and everyone fixes them.
Communication is as important as technical response. Keep stakeholders informed, even if the news is "still investigating."

What's Next

With a solid incident management process in place, you can handle production problems professionally. In Part 5, we'll cover:

Capacity planning and forecasting
Performance optimization for Go services
Load testing strategies
Cost-effective scaling

Resources

Conclusion

That first chaotic incident at 10:47 PM taught me that technical skills alone don't make you good at SRE - process and preparation do. Now when incidents happen, I'm calm because I have:

Clear severity levels to guide response
A systematic workflow to follow
Runbooks for common issues
A blameless post-mortem process to learn

Incidents will always be stressful, but they don't have to be chaotic. Build your processes now, while you're calm. You'll thank yourself later when you're paged at 2 AM.

PreviousPart 3: Monitoring and Observability - Seeing What Your System Is Really Doing NextPart 5: Capacity Planning and Performance - Growing Without Breaking

Last updated 11 days ago

hashtagMy First Real Production Incident

hashtagWhat is an Incident?

hashtagIncident Severity Levels

hashtagSEV-1 (Critical)

hashtagSEV-2 (High)

hashtagSEV-3 (Medium)

hashtagSEV-4 (Low)

hashtagSetting Up On-Call: A Solo Developer's Approach

hashtag1. Define On-Call Expectations

hashtag2. Set Up Alerting

hashtag3. Alert on What Matters

hashtagMy Incident Response Workflow

hashtagPhase 1: Detection (T+0 to T+5 minutes)

hashtagPhase 2: Response (T+5 to resolution)

hashtagPhase 3: Communication

hashtagPhase 4: Resolution

hashtagBuilding Runbooks

hashtagRunbook Example: Database Connection Pool Exhausted

hashtag2. Check for slow queries

hashtagImmediate Mitigation

hashtagOption 1: Increase connection pool (quick fix)

hashtagOption 2: Kill long-running queries

hashtagVerification

hashtagLong-term Fix

hashtagRelated Runbooks

hashtagPost-Mortem: Learning from Incidents

hashtagMy Post-Mortem Template

hashtagBlameless Culture

hashtagIncident Response Toolkit

hashtag1. Incident Command Checklist

hashtag2. Incident Chat Template

hashtag3. Quick Commands Script

hashtagGrowing Beyond Solo On-Call

hashtagRotation Schedule

hashtagEscalation Policy

hashtagOn-Call Handoff

hashtagCommon Mistakes I Made

hashtagMistake 1: Not Declaring Incidents Soon Enough

hashtagMistake 2: Focusing on Root Cause During Active Incident

hashtagMistake 3: No Runbooks

hashtagMistake 4: Skipping Post-Mortems

hashtagKey Takeaways

hashtagWhat's Next

hashtagResources

hashtagConclusion

My First Real Production Incident

What is an Incident?

Incident Severity Levels

SEV-1 (Critical)

SEV-2 (High)

SEV-3 (Medium)

SEV-4 (Low)

Setting Up On-Call: A Solo Developer's Approach

1. Define On-Call Expectations

2. Set Up Alerting

3. Alert on What Matters

My Incident Response Workflow

Phase 1: Detection (T+0 to T+5 minutes)

Phase 2: Response (T+5 to resolution)

Phase 3: Communication

Phase 4: Resolution

Building Runbooks

Runbook Example: Database Connection Pool Exhausted

2. Check for slow queries

Immediate Mitigation

Option 1: Increase connection pool (quick fix)

Option 2: Kill long-running queries

Verification

Long-term Fix

Related Runbooks

Post-Mortem: Learning from Incidents

My Post-Mortem Template

Blameless Culture

Incident Response Toolkit

1. Incident Command Checklist

2. Incident Chat Template

3. Quick Commands Script

Growing Beyond Solo On-Call

Rotation Schedule

Escalation Policy

On-Call Handoff

Common Mistakes I Made

Mistake 1: Not Declaring Incidents Soon Enough

Mistake 2: Focusing on Root Cause During Active Incident

Mistake 3: No Runbooks

Mistake 4: Skipping Post-Mortems

Key Takeaways

What's Next

Resources

Conclusion