Part 4: Incident Management - From Chaos to Coordinated Response

What You'll Learn: This article shares my journey from panicking during production incidents to building a calm, systematic response process. You'll learn how to define incident severity levels, set up on-call rotations for small teams, create an effective incident response workflow, write blameless post-mortems that drive improvement, and build runbooks that actually help during the chaos. By the end, you'll have a framework for managing incidents professionally instead of frantically.

My First Real Production Incident

It was 10:47 PM on a Saturday when my phone started buzzing. My personal project - a Go-based expense tracking API that about 50 friends used - was completely down. Users couldn't access the app, and I had no idea why.

I SSH'd into my DigitalOcean droplet in a panic. My hands were shaking as I ran commands randomly:

# What's running?
ps aux | grep expense-api

# Is there disk space?
df -h

# What about logs?
tail /var/log/expense-api.log

# Maybe just restart it?
systemctl restart expense-api

The service came back up for 30 seconds, then crashed again. I spent two hours in this panicked cycle:

  1. Restart service

  2. Watch it crash

  3. Guess at the problem

  4. Try a random fix

  5. Repeat

Finally, at 1 AM, I discovered the issue: my database ran out of disk space because I wasn't rotating logs. A problem that should have taken 5 minutes to fix took over 2 hours because I had no process.

That night taught me that having a systematic incident response process matters more than technical skill when you're under pressure.

What is an Incident?

After that chaotic experience, I defined what constitutes an "incident" for my services:

An incident is any event that causes or could cause:

  • Service degradation that impacts users

  • Violation of SLOs

  • Security breach or data exposure

  • Significant risk to system stability

Not incidents (these are just operational work):

  • Planned maintenance

  • Expected behavior (like rate limiting)

  • Single user issues

  • Development environment problems

This definition helps me decide: "Is this worth paging someone at 2 AM?"

Incident Severity Levels

I learned to categorize incidents by severity, which determines response urgency. Here's the framework I use:

SEV-1 (Critical)

Impact: Complete service outage or critical security breach Response: Immediate, all hands on deck Examples:

  • API completely down

  • Database corrupted

  • Active security breach

  • Payment processing failing

My response:

  • Page on-call immediately

  • Update status page within 5 minutes

  • All hands working to resolve

  • Executive communication if prolonged

SEV-2 (High)

Impact: Major feature degraded, affecting many users Response: Urgent, within 30 minutes Examples:

  • 50% error rate on critical endpoint

  • Authentication slow (5+ seconds)

  • File uploads failing

My response:

  • Page on-call within 30 minutes

  • Update status page

  • Focused team working to resolve

  • Can escalate to SEV-1 if worsening

SEV-3 (Medium)

Impact: Minor degradation, limited user impact Response: During business hours Examples:

  • Non-critical feature failing

  • Slow performance on rarely-used endpoint

  • Minor data inconsistency

My response:

  • Create ticket, investigate during business hours

  • Fix within 24-48 hours

  • Monitor to ensure it doesn't worsen

SEV-4 (Low)

Impact: Cosmetic issues, no user impact Response: Backlog Examples:

  • Typo in log message

  • Metrics dashboard formatting issue

  • Documentation outdated

My response:

  • Add to backlog

  • Fix when convenient

Setting Up On-Call: A Solo Developer's Approach

When I first started, I was the only person supporting my services. Here's how I made on-call sustainable:

1. Define On-Call Expectations

I documented what "on-call" means for me:

2. Set Up Alerting

I use PagerDuty to manage on-call, but you can start with basic email/SMS alerts.

3. Alert on What Matters

I learned the hard way: alert only on user-impacting issues, not every anomaly.

Bad alerts (noise):

Good alerts (actionable):

My Incident Response Workflow

After several chaotic incidents, I developed this systematic workflow:

Phase 1: Detection (T+0 to T+5 minutes)

Goal: Recognize and acknowledge the incident

My checklist:

Phase 2: Response (T+5 to resolution)

Goal: Stop the bleeding and restore service

My response template:

Phase 3: Communication

Goal: Keep stakeholders informed

For my personal projects, "stakeholders" means users. I post updates to a simple status page.

My communication cadence:

  • Initial update: Within 5 minutes of detection

  • Progress updates: Every 30 minutes during active incident

  • Resolution update: When service is stable

  • Post-mortem: Within 48 hours

Phase 4: Resolution

Goal: Confirm stability and close incident

Building Runbooks

A runbook is a step-by-step guide for handling specific scenarios. They're lifesavers during incidents when your brain is foggy from stress or sleep deprivation.

Runbook Example: Database Connection Pool Exhausted

2. Check for slow queries

Immediate Mitigation

Option 1: Increase connection pool (quick fix)

Option 2: Kill long-running queries

Verification

Long-term Fix

  • Runbook: High Database CPU

  • Runbook: Slow Query Investigation

Post-Mortem: Learning from Incidents

The most valuable part of incident management isn't the response - it's what you learn afterwards. I write a post-mortem for every SEV-1 and SEV-2 incident.

My Post-Mortem Template

Blameless Culture

The most important lesson I learned: never blame people in post-mortems.

Bad post-mortem:

"Engineer X forgot to load test the connection pool change."

Good post-mortem:

"Connection pool sizing wasn't validated under load because we lacked a load testing process in our deployment checklist."

The goal is to identify system failures, not human failures. If a human made a mistake, ask: "What process failed that allowed this mistake?"

Incident Response Toolkit

Here are the tools I use during incidents:

1. Incident Command Checklist

2. Incident Chat Template

I use Slack for incident coordination. Here's my template:

3. Quick Commands Script

I keep a script of common diagnostic commands:

Growing Beyond Solo On-Call

As my services grew and I added teammates, I evolved my on-call approach:

Rotation Schedule

Escalation Policy

On-Call Handoff

Every Monday morning, we do a 15-minute handoff:

Common Mistakes I Made

Mistake 1: Not Declaring Incidents Soon Enough

Early on, I'd try to fix issues quietly without declaring an incident. This led to:

  • Delayed response

  • No communication to users

  • No post-mortem or learning

Fix: When in doubt, declare an incident. You can always downgrade severity.

Mistake 2: Focusing on Root Cause During Active Incident

I'd waste time during outages trying to understand exactly why something failed instead of just fixing it.

Fix: During active incident, focus on mitigation. Root cause analysis comes later in the post-mortem.

Mistake 3: No Runbooks

I'd try to remember how to fix issues from memory, often forgetting critical steps.

Fix: Write a runbook after every incident. Next time it happens, you have a checklist.

Mistake 4: Skipping Post-Mortems

"I know what went wrong, I'll just fix it."

Fix: Always write post-mortems for SEV-1/SEV-2. The act of writing surfaces insights you'd otherwise miss.

Key Takeaways

  1. Have a process before you need it. When you're stressed during an incident, you fall back to your processes. Make them good.

  2. Severity levels drive response urgency. Not everything is SEV-1. Appropriate classification prevents burnout.

  3. Runbooks are force multipliers. A good runbook means anyone can handle the incident, not just the person who built it.

  4. Blameless post-mortems drive improvement. Blame people, and they hide mistakes. Blame systems, and everyone fixes them.

  5. Communication is as important as technical response. Keep stakeholders informed, even if the news is "still investigating."

What's Next

With a solid incident management process in place, you can handle production problems professionally. In Part 5, we'll cover:

  • Capacity planning and forecasting

  • Performance optimization for Go services

  • Load testing strategies

  • Cost-effective scaling

Resources

Conclusion

That first chaotic incident at 10:47 PM taught me that technical skills alone don't make you good at SRE - process and preparation do. Now when incidents happen, I'm calm because I have:

  • Clear severity levels to guide response

  • A systematic workflow to follow

  • Runbooks for common issues

  • A blameless post-mortem process to learn

Incidents will always be stressful, but they don't have to be chaotic. Build your processes now, while you're calm. You'll thank yourself later when you're paged at 2 AM.

Last updated