Part 6: Incident Management and On-Call Automation

Part of the SRE Playbook series

What You'll Learn: This article covers how I wire up the SLO-based alerts from Part 5 into an actionable incident management workflow. You'll see my Alertmanager routing configuration, Slack and PagerDuty integration, how I store runbooks as Markdown in the GitOps repository, a Go CLI tool I built for incident response tasks, and how I conduct post-incident reviews. The goal is to make the 2 AM page as short as possible.

The First Real Incident

A few months after running the GoReliable platform, I got paged at 11 PM. The API Gateway's availability SLO burn-rate alert fired — burn rate at 18×. Orders were failing.

I had just set up the SLO alerting from Part 5. Without it, I might not have noticed until the next morning. With it, I was investigating within minutes.

What I found: the Order Service's database connection pool was exhausted. A long-running query had accumulated 25 connections, all waiting for a table lock. New order requests were timing out waiting for a connection. I killed the blocking query, connections freed up, orders started succeeding.

The incident itself took 22 minutes to resolve. But the investigation took 15 of those minutes because I had no runbook — I was recreating the debugging steps from memory.

After writing the post-incident review, I built the incident tooling this article describes. The next time a similar pattern fires, the runbook tells me exactly where to look.

Alertmanager Configuration

Alertmanager handles routing alerts from Prometheus to notification channels. I configure it with two routes: critical alerts go to PagerDuty (triggers a page) and warning alerts go to Slack only.

# infrastructure/prometheus-stack/alertmanager-config.yaml
# Applied as a Secret to the kube-prometheus-stack Helm release
global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  receiver: 'null'          # Default: silence alerts not matched by any route
  group_by: ['alertname', 'sloth_slo', 'severity']
  group_wait: 30s           # Wait 30s before sending first notification (group related alerts)
  group_interval: 5m        # Wait 5m before sending updates on already-firing alerts
  repeat_interval: 4h       # Re-notify every 4h if alert is still firing

  routes:
    # Critical SLO burn-rate alerts → PagerDuty
    - matchers:
        - severity = critical
        - alertname =~ ".*Page$"
      receiver: pagerduty-critical
      continue: true         # Also send to Slack

    # Critical alerts → Slack
    - matchers:
        - severity = critical
      receiver: slack-critical

    # Warning alerts → Slack only
    - matchers:
        - severity = warning
      receiver: slack-warning

    # Watchdog — always fires to confirm alerting pipeline is working
    - matchers:
        - alertname = Watchdog
      receiver: 'null'

receivers:
  - name: 'null'

  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: '{{ .Values.pagerduty.serviceKey }}'
        severity: critical
        description: '{{ template "pagerduty.default.description" . }}'
        details:
          slo: '{{ .CommonLabels.sloth_slo }}'
          burn_rate: '{{ .CommonLabels.sloth_burn_rate }}'
          runbook: '{{ .CommonAnnotations.runbook }}'

  - name: slack-critical
    slack_configs:
      - api_url: '{{ .Values.slack.webhookURL }}'
        channel: '#incidents'
        color: 'danger'
        title: '🔴 Critical: {{ .CommonLabels.alertname }}'
        text: |
          *SLO:* {{ .CommonLabels.sloth_slo }}
          *Burn Rate:* {{ .CommonLabels.sloth_burn_rate }}×
          *Summary:* {{ .CommonAnnotations.summary }}
          *Runbook:* {{ .CommonAnnotations.runbook }}
        send_resolved: true

  - name: slack-warning
    slack_configs:
      - api_url: '{{ .Values.slack.webhookURL }}'
        channel: '#sre-alerts'
        color: 'warning'
        title: '🟡 Warning: {{ .CommonLabels.alertname }}'
        text: |
          *SLO:* {{ .CommonLabels.sloth_slo }}
          *Summary:* {{ .CommonAnnotations.summary }}
          *Runbook:* {{ .CommonAnnotations.runbook }}
        send_resolved: true

inhibit_rules:
  # Suppress warnings when a critical alert for the same SLO is already firing
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ['sloth_slo']

The inhibit_rules section is important. Without it, when the critical alert fires, the warning also fires — resulting in duplicate notifications for the same problem. The inhibit rule suppresses the warning when the critical is already active.

Runbooks as Code

Every alert in the Alertmanager config references a runbook URL. I store runbooks as Markdown files in the go-reliable-gitops repository under a runbooks/ directory.

Storing runbooks in Git means:

They're versioned — I can see when a runbook was last updated and why
They go through code review — bad runbooks get caught before an incident
They're co-located with the system they describe — no separate wiki to keep in sync

Here's the runbook for the database connection pool alert:

# Runbook: Order Service — Database Connection Pool Exhaustion

**Alert:** OrderServiceHighDBConnectionUsage
**Severity:** Warning → Critical
**SLO Impact:** Order creation success rate

## Symptoms

- Order creation requests timing out or returning 500 errors
- `goreliable_order_service_db_active_connections` metric near the `DB_MAX_OPEN_CONNS` limit
- Prometheus alert: `OrderServiceDBConnectionPoolExhausted`

## Likely Causes (in order of frequency)

1. **Long-running query holding connections** — A slow query locks rows and holds connections for extended periods
2. **Sudden traffic spike** — More concurrent requests than the pool can support
3. **Connection leak** — Connections not being properly released (check for missing `rows.Close()` calls)
4. **Migration running** — Database migrations consume extra connections

## Investigation Steps

### 1. Check current connection count

```bash
kubectl exec -it -n go-reliable-production \
  $(kubectl get pod -n go-reliable-production -l app=order-service -o jsonpath='{.items[0].metadata.name}') \
  -- /bin/sh -c "wget -q -O- http://localhost:9090/metrics | grep db_active_connections"

2. Identify blocking queries in PostgreSQL

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC;

3. Kill blocking query (if safe)

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = <blocking_pid>;

Resolution

If blocking query: terminate it, investigate root cause
If traffic spike: check if HPA has scaled — if not, manually scale: kubectl scale deployment order-service -n go-reliable-production --replicas=4
If connection leak: this requires a code fix — restart the service as a temporary measure

Escalation

If not resolved within 15 minutes, escalate. The Order Service not creating orders is direct revenue impact.

Post-Incident

File a post-incident review using the template in /runbooks/templates/post-incident.md.


Every runbook follows this format: symptoms, likely causes, investigation steps, resolution, escalation, post-incident.

## The Go Incident CLI

During incidents I found myself running the same sequences of kubectl commands. I built a small Go CLI — `gorel-incident` — that automates the common investigation and response tasks.

```go
// cmd/gorel-incident/main.go
package main

import (
    "os"

    "github.com/spf13/cobra"
    "github.com/htunn/go-reliable/internal/incident"
)

func main() {
    rootCmd := &cobra.Command{
        Use:   "gorel-incident",
        Short: "GoReliable incident response CLI",
    }

    rootCmd.AddCommand(
        incident.NewStatusCmd(),
        incident.NewScaleCmd(),
        incident.NewRestartCmd(),
        incident.NewDrainCmd(),
        incident.NewProfileCmd(),
    )

    if err := rootCmd.Execute(); err != nil {
        os.Exit(1)
    }
}

// internal/incident/status.go
package incident

import (
    "context"
    "fmt"
    "os"
    "text/tabwriter"

    "github.com/spf13/cobra"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/tools/clientcmd"
)

func NewStatusCmd() *cobra.Command {
    var namespace string

    cmd := &cobra.Command{
        Use:   "status",
        Short: "Show status of all GoReliable services",
        RunE: func(cmd *cobra.Command, args []string) error {
            return runStatus(namespace)
        },
    }

    cmd.Flags().StringVarP(&namespace, "namespace", "n", "go-reliable-production", "Kubernetes namespace")
    return cmd
}

func runStatus(namespace string) error {
    config, err := clientcmd.BuildConfigFromFlags("", clientcmd.RecommendedHomeFile)
    if err != nil {
        return fmt.Errorf("build kubeconfig: %w", err)
    }

    cs, err := kubernetes.NewForConfig(config)
    if err != nil {
        return fmt.Errorf("create k8s client: %w", err)
    }

    deployments, err := cs.AppsV1().Deployments(namespace).List(context.Background(), metav1.ListOptions{})
    if err != nil {
        return fmt.Errorf("list deployments: %w", err)
    }

    w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
    fmt.Fprintln(w, "NAME\tREADY\tUP-TO-DATE\tAVAILABLE\tAGE")
    for _, d := range deployments.Items {
        fmt.Fprintf(w, "%s\t%d/%d\t%d\t%d\t%s\n",
            d.Name,
            d.Status.ReadyReplicas, d.Spec.Replicas,
            d.Status.UpdatedReplicas,
            d.Status.AvailableReplicas,
            d.CreationTimestamp.String(),
        )
    }
    return w.Flush()
}

// internal/incident/scale.go
func NewScaleCmd() *cobra.Command {
    var namespace string
    var replicas int32

    cmd := &cobra.Command{
        Use:   "scale [deployment]",
        Short: "Emergency scale a deployment",
        Args:  cobra.ExactArgs(1),
        RunE: func(cmd *cobra.Command, args []string) error {
            return runScale(namespace, args[0], replicas)
        },
    }

    cmd.Flags().StringVarP(&namespace, "namespace", "n", "go-reliable-production", "Kubernetes namespace")
    cmd.Flags().Int32VarP(&replicas, "replicas", "r", 0, "Target replica count")
    cmd.MarkFlagRequired("replicas")

    return cmd
}

func runScale(namespace, deployment string, replicas int32) error {
    // Build client, scale deployment
    // ...
    fmt.Printf("Scaled %s to %d replicas in %s\n", deployment, replicas, namespace)
    return nil
}

During an incident, instead of:

kubectl get deployments -n go-reliable-production
kubectl scale deployment order-service -n go-reliable-production --replicas=4
kubectl rollout status deployment/order-service -n go-reliable-production

I run:

gorel-incident status
gorel-incident scale order-service -r 4

The CLI is deployed as a Kubernetes Job when I need it from within the cluster, and as a local binary when running from my laptop with KUBECONFIG set.

Toil Identification and Reduction

I define toil as manual, repetitive work that doesn't improve reliability — it just keeps the lights on. I track it with a simple log.

The most expensive toil I found: restarting the Notification Worker every few days when it accumulated a message processing backlog due to a slow email provider. The fix took 3 hours to implement (add a circuit breaker and a metric-based alert), but eliminated 30 minutes of weekly manual work.

// internal/notification/circuitbreaker.go
package notification

import (
    "errors"
    "sync"
    "time"
)

type CircuitState int

const (
    StateClosed   CircuitState = iota // Normal operation
    StateOpen                         // Rejecting calls
    StateHalfOpen                     // Testing if service recovered
)

type CircuitBreaker struct {
    mu           sync.RWMutex
    state        CircuitState
    failures     int
    threshold    int
    resetAfter   time.Duration
    lastFailure  time.Time
}

func NewCircuitBreaker(threshold int, resetAfter time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        threshold:  threshold,
        resetAfter: resetAfter,
        state:      StateClosed,
    }
}

var ErrCircuitOpen = errors.New("circuit breaker is open")

func (cb *CircuitBreaker) Call(fn func() error) error {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateOpen:
        if time.Since(cb.lastFailure) > cb.resetAfter {
            cb.state = StateHalfOpen
        } else {
            return ErrCircuitOpen
        }
    }

    err := fn()

    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.threshold {
            cb.state = StateOpen
        }
    } else {
        cb.failures = 0
        cb.state = StateClosed
    }

    return err
}

The circuit breaker wraps the email sending call. If the email provider fails 5 consecutive times, the circuit opens and messages are NAK'd back to NATS for redelivery rather than accumulating in a stuck worker.

Post-Incident Review Template

After every significant incident I write a post-incident review. I keep the template in the GitOps repo:

# Post-Incident Review: [SERVICE] [DATE]

## Summary

**Duration:** [start] to [end] ([X] minutes)
**Impact:** [User-facing impact description]
**SLO Impact:** [Error budget consumed]

## Timeline

| Time | Event |
|------|-------|
| HH:MM | Alert fired |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |

## Root Cause

[Technical description of what went wrong]

## Contributing Factors

- [Factor 1]
- [Factor 2]

## What Went Well

- [Positive 1]

## What Went Badly

- [Negative 1]

## Action Items

| Item | Owner | Due Date |
|------|-------|----------|
| [Action] | [Person] | [Date] |

The action items are the most important part. Without them, post-incident reviews are academic exercises. I track them as GitHub issues with a post-incident label.

In Part 7, I take a proactive approach to reliability: load testing the services before incidents find the limits, profiling Go binaries in production, and intentionally breaking things with Chaos Mesh to verify the system holds.

PreviousPart 5: SLIs, SLOs, and Error Budgets in Practice NextPart 7: Capacity Planning, Performance, and Chaos Engineering

Last updated 4 days ago

hashtagThe First Real Incident

hashtagAlertmanager Configuration

hashtagRunbooks as Code

hashtag2. Identify blocking queries in PostgreSQL

hashtag3. Kill blocking query (if safe)

hashtagResolution

hashtagEscalation

hashtagPost-Incident

hashtagToil Identification and Reduction

hashtagPost-Incident Review Template