Part 6: Incident Management and On-Call Automation

Part of the SRE Playbook series

What You'll Learn: This article covers how I wire up the SLO-based alerts from Part 5 into an actionable incident management workflow. You'll see my Alertmanager routing configuration, Slack and PagerDuty integration, how I store runbooks as Markdown in the GitOps repository, a Go CLI tool I built for incident response tasks, and how I conduct post-incident reviews. The goal is to make the 2 AM page as short as possible.

The First Real Incident

A few months after running the GoReliable platform, I got paged at 11 PM. The API Gateway's availability SLO burn-rate alert fired — burn rate at 18×. Orders were failing.

I had just set up the SLO alerting from Part 5. Without it, I might not have noticed until the next morning. With it, I was investigating within minutes.

What I found: the Order Service's database connection pool was exhausted. A long-running query had accumulated 25 connections, all waiting for a table lock. New order requests were timing out waiting for a connection. I killed the blocking query, connections freed up, orders started succeeding.

The incident itself took 22 minutes to resolve. But the investigation took 15 of those minutes because I had no runbook — I was recreating the debugging steps from memory.

After writing the post-incident review, I built the incident tooling this article describes. The next time a similar pattern fires, the runbook tells me exactly where to look.

Alertmanager Configuration

Alertmanager handles routing alerts from Prometheus to notification channels. I configure it with two routes: critical alerts go to PagerDuty (triggers a page) and warning alerts go to Slack only.

# infrastructure/prometheus-stack/alertmanager-config.yaml
# Applied as a Secret to the kube-prometheus-stack Helm release
global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  receiver: 'null'          # Default: silence alerts not matched by any route
  group_by: ['alertname', 'sloth_slo', 'severity']
  group_wait: 30s           # Wait 30s before sending first notification (group related alerts)
  group_interval: 5m        # Wait 5m before sending updates on already-firing alerts
  repeat_interval: 4h       # Re-notify every 4h if alert is still firing

  routes:
    # Critical SLO burn-rate alerts → PagerDuty
    - matchers:
        - severity = critical
        - alertname =~ ".*Page$"
      receiver: pagerduty-critical
      continue: true         # Also send to Slack

    # Critical alerts → Slack
    - matchers:
        - severity = critical
      receiver: slack-critical

    # Warning alerts → Slack only
    - matchers:
        - severity = warning
      receiver: slack-warning

    # Watchdog — always fires to confirm alerting pipeline is working
    - matchers:
        - alertname = Watchdog
      receiver: 'null'

receivers:
  - name: 'null'

  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: '{{ .Values.pagerduty.serviceKey }}'
        severity: critical
        description: '{{ template "pagerduty.default.description" . }}'
        details:
          slo: '{{ .CommonLabels.sloth_slo }}'
          burn_rate: '{{ .CommonLabels.sloth_burn_rate }}'
          runbook: '{{ .CommonAnnotations.runbook }}'

  - name: slack-critical
    slack_configs:
      - api_url: '{{ .Values.slack.webhookURL }}'
        channel: '#incidents'
        color: 'danger'
        title: '🔴 Critical: {{ .CommonLabels.alertname }}'
        text: |
          *SLO:* {{ .CommonLabels.sloth_slo }}
          *Burn Rate:* {{ .CommonLabels.sloth_burn_rate }}×
          *Summary:* {{ .CommonAnnotations.summary }}
          *Runbook:* {{ .CommonAnnotations.runbook }}
        send_resolved: true

  - name: slack-warning
    slack_configs:
      - api_url: '{{ .Values.slack.webhookURL }}'
        channel: '#sre-alerts'
        color: 'warning'
        title: '🟡 Warning: {{ .CommonLabels.alertname }}'
        text: |
          *SLO:* {{ .CommonLabels.sloth_slo }}
          *Summary:* {{ .CommonAnnotations.summary }}
          *Runbook:* {{ .CommonAnnotations.runbook }}
        send_resolved: true

inhibit_rules:
  # Suppress warnings when a critical alert for the same SLO is already firing
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ['sloth_slo']

The inhibit_rules section is important. Without it, when the critical alert fires, the warning also fires — resulting in duplicate notifications for the same problem. The inhibit rule suppresses the warning when the critical is already active.

Runbooks as Code

Every alert in the Alertmanager config references a runbook URL. I store runbooks as Markdown files in the go-reliable-gitops repository under a runbooks/ directory.

Storing runbooks in Git means:

  • They're versioned — I can see when a runbook was last updated and why

  • They go through code review — bad runbooks get caught before an incident

  • They're co-located with the system they describe — no separate wiki to keep in sync

Here's the runbook for the database connection pool alert:

2. Identify blocking queries in PostgreSQL

3. Kill blocking query (if safe)

Resolution

  • If blocking query: terminate it, investigate root cause

  • If traffic spike: check if HPA has scaled — if not, manually scale: kubectl scale deployment order-service -n go-reliable-production --replicas=4

  • If connection leak: this requires a code fix — restart the service as a temporary measure

Escalation

If not resolved within 15 minutes, escalate. The Order Service not creating orders is direct revenue impact.

Post-Incident

File a post-incident review using the template in /runbooks/templates/post-incident.md.

During an incident, instead of:

I run:

The CLI is deployed as a Kubernetes Job when I need it from within the cluster, and as a local binary when running from my laptop with KUBECONFIG set.

Toil Identification and Reduction

I define toil as manual, repetitive work that doesn't improve reliability — it just keeps the lights on. I track it with a simple log.

The most expensive toil I found: restarting the Notification Worker every few days when it accumulated a message processing backlog due to a slow email provider. The fix took 3 hours to implement (add a circuit breaker and a metric-based alert), but eliminated 30 minutes of weekly manual work.

The circuit breaker wraps the email sending call. If the email provider fails 5 consecutive times, the circuit opens and messages are NAK'd back to NATS for redelivery rather than accumulating in a stuck worker.

Post-Incident Review Template

After every significant incident I write a post-incident review. I keep the template in the GitOps repo:

The action items are the most important part. Without them, post-incident reviews are academic exercises. I track them as GitHub issues with a post-incident label.

In Part 7, I take a proactive approach to reliability: load testing the services before incidents find the limits, profiling Go binaries in production, and intentionally breaking things with Chaos Mesh to verify the system holds.

Last updated