Part 8: Operational Excellence

The Runbook That Saved My Weekend

It was Saturday at 2 PM when a critical alert fired. I was on-call, but thankfully I wasn't alone—the runbook I'd written three months earlier walked me through the exact steps to resolve the issue. Fifteen minutes later, the incident was resolved, and I was back to my weekend.

That's the power of good operational documentation. But I learned this the hard way. Early in my career, I'd get paged at 3 AM, scramble to remember what to do, and waste precious minutes searching through Slack history and old tickets. Now, our team has comprehensive runbooks, clear on-call practices, and documentation that actually helps during emergencies.

The Three Pillars of Operational Excellence

Runbooks: Step-by-step guides for resolving common issues
On-Call Practices: Sustainable rotation and response protocols
Operational Documentation: Architecture diagrams, dependencies, and decision logs

Let me show you how I build each pillar.

Runbooks: Your 3 AM Friend

A runbook is a documented procedure for handling operational tasks, particularly during incidents. Good runbooks are:

Action-oriented: Steps to take, not concepts to understand
Tested regularly: Run through them during drills
Easy to find: Linked from alerts and dashboards
Maintained: Updated after every incident

Runbook Template

# Runbook: High API Error Rate

**Alert**: `HighErrorRate`  
**Severity**: SEV1 / SEV2  
**Service**: payment-api  
**Owner**: Platform Team (@platform-team)

## Symptoms

- PagerDuty alert: "High error rate detected"
- Grafana showing spike in 5xx errors
- Customer reports of failed payments

## Impact

Users cannot complete payments. Revenue loss accumulates quickly.

## Quick Diagnosis

**Check these dashboards first:**
1. [Service Overview Dashboard](https://grafana.company.com/d/service-overview)
2. [Error Details Dashboard](https://grafana.company.com/d/errors)
3. [Dependency Health](https://grafana.company.com/d/dependencies)

**Key metrics to check:**
- Current error rate: `sum(rate(http_requests_total{status=~"5.."}[5m]))`
- Error types: Group by status code (500, 502, 503, 504)
- Affected endpoints: Group by path
- Time started: Check when errors began

## Common Causes & Resolutions

### Cause 1: Recent Deployment

**Symptoms**: Errors started immediately after deployment  
**Resolution**: Rollback

```bash
# Check recent deployments
kubectl rollout history deployment/payment-api -n production

# Rollback to previous version
kubectl rollout undo deployment/payment-api -n production

# Monitor recovery
watch kubectl get pods -n production -l app=payment-api

Expected time: 3-5 minutes Success criteria: Error rate drops below 1%

Cause 2: Database Connection Issues

Symptoms:

Errors are "connection timeout" or "too many connections"
Database dashboard shows high connection count

Resolution: Restart pods to clear connection pool

# Check connection pool usage
kubectl exec -it deployment/payment-api -n production -- \
  curl localhost:8080/metrics | grep db_connections

# If connections > 90%, restart pods
kubectl rollout restart deployment/payment-api -n production

# Monitor
kubectl rollout status deployment/payment-api -n production

Expected time: 2-3 minutes Success criteria: Connection pool below 80%, error rate normal

Cause 3: Downstream Service Failure

Symptoms:

Errors are 502 Bad Gateway or 504 Gateway Timeout
One specific endpoint failing
Dependency dashboard shows failures

Resolution: Enable circuit breaker

# Check which dependency is failing
./scripts/check-dependencies.sh

# Enable circuit breaker for failing service
redis-cli SET "circuit-breaker:payment-gateway:forced-open" "1" EX 600

# Monitor - errors should change from 502 to 503 (expected)
# Enable graceful degradation if available

Expected time: 1 minute Success criteria: No more 502s, service degraded but functional

Cause 4: Resource Exhaustion

Symptoms:

Pods showing high CPU/memory usage
Slow response times before errors
OOMKilled in pod events

Resolution: Scale up immediately

# Check resource usage
kubectl top pods -n production -l app=payment-api

# Scale up
kubectl scale deployment/payment-api -n production --replicas=15

# Monitor
kubectl get hpa -n production
watch kubectl get pods -n production -l app=payment-api

Expected time: 2-4 minutes Success criteria: CPU/memory below 70%, error rate normal

Still Not Resolved?

If none of the above worked:

Escalate: Page tech lead via PagerDuty: pd escalate

Enable maintenance mode: Buy time to investigate

kubectl patch service payment-api -n production -p '{"spec":{"selector":{"maintenance":"true"}}}'

Check recent changes:
- Database migrations: ./scripts/check-migrations.sh
- Infrastructure changes: Check Terraform Cloud runs
- Configuration changes: Check ArgoCD sync history

After Resolution

Document what happened: Add comment to PagerDuty incident
Update metrics: Note resolution time and method
Schedule postmortem: If SEV1/SEV2, create postmortem doc
Update this runbook: If you found new information

Useful Commands

# View recent logs
kubectl logs -n production deployment/payment-api --tail=100 -f

# Check pod status
kubectl describe pod -n production -l app=payment-api

# Force sync ArgoCD
argocd app sync payment-api-production

# Execute inside pod
kubectl exec -it -n production deployment/payment-api -- /bin/sh

Contact Information

Team Slack: #platform-team
Incident Channel: #incidents
Tech Lead: @alice (primary), @bob (backup)
PagerDuty: Platform Team escalation policy

Last Updated: 2026-02-17 Last Tested: 2026-02-10 (during monthly drill)


### Runbook Organization

I organize runbooks in a searchable wiki with this structure:

docs/runbooks/ ├── README.md # Runbook index ├── services/ │ ├── payment-api/ │ │ ├── high-error-rate.md │ │ ├── high-latency.md │ │ ├── pod-crashloop.md │ │ └── deployment-issues.md │ └── user-service/ │ └── ... ├── infrastructure/ │ ├── database-connection-pool.md │ ├── redis-cache-miss.md │ ├── kubernetes-node-not-ready.md │ └── load-balancer-issues.md ├── operations/ │ ├── deployment-rollback.md │ ├── maintenance-mode.md │ ├── scaling-services.md │ └── database-migration.md └── templates/ └── runbook-template.md


### Linking Runbooks to Alerts

Every alert should link to its runbook:

```yaml
# prometheus-alerts/payment-api.yml
groups:
- name: payment-api-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      (sum(rate(http_requests_total{status=~"5.."}[5m])) /
       sum(rate(http_requests_total[5m]))) > 0.05
    for: 2m
    labels:
      severity: critical
      team: platform
      service: payment-api
    annotations:
      summary: "High error rate in payment-api"
      description: "{{ $value | humanizePercentage }} error rate"
      runbook: "https://docs.company.com/runbooks/services/payment-api/high-error-rate"
      dashboard: "https://grafana.company.com/d/payment-api-overview"
      playbook: "https://docs.company.com/incident-response/quick-start"

Runbook Testing

I test runbooks quarterly during scheduled drills:

## Q1 2026 Runbook Drill Schedule

### Week of February 17
- **Service**: payment-api
- **Scenario**: High error rate due to database issues
- **Runbook**: services/payment-api/high-error-rate.md
- **Participant**: On-call engineers (rotating)
- **Environment**: Staging
- **Success Criteria**: 
  - Issue resolved within 15 minutes
  - All steps in runbook work as documented
  - No missing information identified

### Week of February 24
- **Service**: user-service  
- **Scenario**: Pod crashloop after bad deployment
- **Runbook**: services/user-service/pod-crashloop.md
- ...

After each drill, update the runbook based on findings.

On-Call Practices

Being on-call can be stressful. Good practices make it sustainable.

On-Call Rotation

I use a follow-the-sun rotation when possible:

US shifts: 9 AM - 5 PM EST (primary), 5 PM - 9 AM EST (secondary)
EU shifts: 9 AM - 5 PM CET (primary), 5 PM - 9 AM CET (secondary)
Weekend: 24-hour shifts with higher compensation

Rotation schedule: One week on-call, two weeks off

This ensures:

No one is on-call for more than one week at a time
Always two people on-call (primary and secondary)
Minimal middle-of-the-night pages for primary

On-Call Expectations

Before your shift:

Test PagerDuty notifications (app, SMS, phone call)
Review recent incidents and current service health
Ensure laptop is ready (VPN configured, tools installed)
Read handoff notes from previous on-call engineer
Review critical runbooks

During your shift:

Acknowledge alerts within 5 minutes (SEV1/SEV2)
Respond to alerts within 15 minutes (SEV3/SEV4)
Escalate to secondary/tech lead if stuck > 30 minutes
Document actions taken in incident timeline
Update stakeholders every 30 minutes during SEV1/SEV2

After your shift:

Write handoff notes for next on-call engineer
Document any issues or improvements noticed
Follow up on any incomplete action items

Handoff Template

## On-Call Handoff: February 17, 2026

**From**: Alice (@alice)  
**To**: Bob (@bob)  
**Shift**: Feb 10-17, 2026

### Incidents This Week

#### SEV1: Database connection exhaustion (Feb 14)
- **Status**: Resolved
- **Duration**: 23 minutes
- **Root cause**: Connection leak in error handling
- **Fix**: Deployed in v2.5.2
- **Postmortem**: https://docs.company.com/postmortems/2026-02-14-db-connections
- **Action items**: All complete except load testing update (in progress)

#### SEV3: Intermittent 502 errors on user-service (Feb 16)
- **Status**: Monitoring
- **Frequency**: ~5 errors per hour
- **Suspected cause**: Misconfigured health check
- **Next steps**: Deploy fix Monday morning (PR#1245)
- **Watch**: Error rate, may escalate if increases

### Service Health

- ✅ **payment-api**: Healthy, deployed v2.5.2 on Feb 15
- ⚠️ **user-service**: Mostly healthy, watching 502 errors
- ✅ **notification-service**: Healthy
- ⚠️ **database**: Connection pool trending up, at 65% (watch threshold: 70%)

### Upcoming

- **Monday 9 AM**: Deploy user-service fix (PR#1245)
- **Tuesday 2 PM**: Database maintenance window (20 minutes downtime expected)
- **Thursday**: Q1 runbook drill (payment-api high error rate scenario)

### Notes

- Payment gateway (external service) had brief outage Feb 12 (30 min), not our fault
- Increased traffic noticed Friday evening (Feb 16), watch for capacity issues
- Bob's vacation starts Feb 24, Charlie will cover that week

### Questions?

Ping me in #platform-team or DM anytime this weekend if you need context.

On-Call Compensation

Fair compensation is critical for sustainable on-call:

Base on-call pay: $200/week (just for being on-call)
Incident pay: $50/hour for time spent on incidents
Weekend premium: 1.5x incident pay
Comp time: Option to take time off after heavy on-call weeks

I track this automatically:

// src/services/on-call-tracking.ts
interface OnCallShift {
  engineer: string;
  startTime: Date;
  endTime: Date;
  isWeekend: boolean;
  baseCompensation: number;
}

interface IncidentResponse {
  incidentId: string;
  engineer: string;
  startTime: Date;
  endTime: Date;
  durationMinutes: number;
}`

async function calculateWeeklyPay(engineer: string, weekStart: Date): Promise<number> {
  const shift = await getOnCallShift(engineer, weekStart);
  const incidents = await getIncidentsResponded(engineer, weekStart);
  
  let totalPay = shift.baseCompensation; // Base on-call pay
  
  for (const incident of incidents) {
    const hours = incident.durationMinutes / 60;
    const rate = shift.isWeekend ? 75 : 50; // Weekend premium
    totalPay += hours * rate;
  }
  
  return totalPay;
}

Reducing Alert Fatigue

Problem: Too many alerts lead to ignoring them.

My solution: Alert on symptoms, not causes. Alert on impact, not potential impact.

Bad alerts:

❌ "CPU usage above 80%"
❌ "Disk space above 70%"
❌ "Memory usage trending up"

Good alerts:

✅ "Error rate above 5% (users affected)"
✅ "P95 latency above SLO (user experience degraded)"
✅ "Disk space will be full in < 4 hours (action required)"

Alert severity criteria:

Critical (page immediately): Customer impact now
Warning (Slack notification): Will become critical if not addressed in 4+ hours
Info (dashboard only): Good to know, no action required

I track alert quality:

// src/services/alert-quality.ts
interface AlertMetrics {
  alertName: string;
  totalFires: number;
  truePositives: number;   // Led to incident
  falsePositives: number;  // Self-resolved, no action needed
  precision: number;       // truePositives / totalFires
}

async function calculateAlertPrecision(alertName: string, days: number): Promise<AlertMetrics> {
  const alerts = await getAlertHistory(alertName, days);
  
  let truePositives = 0;
  let falsePositives = 0;
  
  for (const alert of alerts) {
    const incident = await findIncidentForAlert(alert);
    if (incident && incident.actionTaken) {
      truePositives++;
    } else {
      falsePositives++;
    }
  }
  
  return {
    alertName,
    totalFires: alerts.length,
    truePositives,
    falsePositives,
    precision: truePositives / alerts.length
  };
}

// Quarterly review: Disable alerts with precision < 50%

Operational Documentation

Beyond runbooks, maintain these key documents:

Architecture Diagrams

# Payment API Architecture

## High-Level Overview

                                ┌─────────────┐
                                │   Client    │
                                └──────┬──────┘
                                       │
                                ┌──────▼──────┐
                                │  CloudFlare │
                                │   WAF & CDN │
                                └──────┬──────┘
                                       │
                    ┌──────────────────┴──────────────────┐
                    │                                     │
             ┌──────▼──────┐                     ┌───────▼─────┐
             │  ALB (US-W2)│                     │ ALB (US-E1) │
             └──────┬──────┘                     └───────┬─────┘
                    │                                     │
         ┌──────────┴─────────┐                ┌─────────┴────────┐
         │                    │                │                  │
  ┌──────▼──────┐      ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
  │ EKS Cluster │      │ EKS Cluster │  │ EKS Cluster │  │ EKS Cluster │
  │  (Primary)  │      │  (Primary)  │  │ (Failover)  │  │ (Failover)  │
  │  Payment    │      │  User       │  │  Payment    │  │  User       │
  │  Service    │      │  Service    │  │  Service    │  │  Service    │
  └──────┬──────┘      └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
         │                    │                │                  │
    ┌────┴────┐          ┌────┴────┐      ┌───┴───┐          ┌───┴───┐
    │ RDS     │          │ RDS     │      │ RDS   │          │ RDS   │
    │ Primary │          │ Primary │      │Replica│          │Replica│
    └────┬────┘          └────┬────┘      └───┬───┘          └───┬───┘
         │                    │               │                  │
         └────────────────────┴───────────────┴──────────────────┘
                              │
                      ┌───────▼────────┐
                      │  ElastiCache   │
                      │  Redis Cluster │
                      └────────────────┘


## Service Dependencies

### payment-api

**Hard dependencies** (service fails without these):
- PostgreSQL database
- Redis cache

**Soft dependencies** (service degrades without these):
- payment-gateway (external, has fallback to manual processing)
- notification-service (errors are queued for retry)
- user-service (has local cache)

### Data Flow

1. Client sends payment request to CloudFlare
2. CloudFlare WAF validates request, forwards to ALB
3. ALB routes to healthy payment-api pod
4. payment-api:
   a. Validates request
   b. Checks Redis cache for recent duplicate
   c. Queries PostgreSQL for user/payment method
   d. Calls external payment-gateway
   e. Stores transaction in PostgreSQL
   f. Publishes event to message queue
   g. Returns response to client

## Failure Modes

| Component | Failure | Impact | Mitigation |
|-----------|---------|--------|------------|
| PostgreSQL | Primary down | SEV1: No writes | Automatic failover to replica (60s) |
| Redis | Cluster down | SEV2: Degraded performance | Cache misses fallback to DB |
| Payment Gateway | Service down | SEV2: No new payments | Circuit breaker, queue for retry |
| EKS Cluster | Cluster down | SEV1: Regional outage | Route53 failover to other region |

Dependency Matrix

# Service Dependency Matrix

| Service | Depends On | Type | Timeout | Retry | Fallback |
|---------|-----------|------|---------|-------|----------|
| payment-api | PostgreSQL | Hard | 5s | No | None |
| payment-api | Redis | Soft | 1s | No | DB queries |
| payment-api | payment-gateway | Soft | 10s | 3x | Queue for later |
| payment-api | user-service | Soft | 3s | 2x | Cached data |
| payment-api | notification-service | Soft | 5s | No | Queue event |
| user-service | PostgreSQL | Hard | 5s | No | None |
| user-service | Redis | Soft | 1s | No | DB queries |

Configuration Reference

# Configuration Reference: payment-api

## Environment Variables

### Required

| Variable | Description | Example | Source |
|----------|-------------|---------|--------|
| `DATABASE_URL` | PostgreSQL connection string | `postgres://user:pass@host/db` | Secret |
| `REDIS_URL` | Redis connection string | `redis://host:6379` | Secret |
| `PAYMENT_GATEWAY_API_KEY` | External API key | `pk_live_...` | Secret |

### Optional

| Variable | Description | Default | Valid Values |
|----------|-------------|---------|--------------|
| `LOG_LEVEL` | Logging verbosity | `info` | `debug`, `info`, `warn`, `error` |
| `MAX_DB_CONNECTIONS` | Connection pool size | `20` | `10-100` |
| `CACHE_TTL_SECONDS` | Cache TTL | `300` | `60-3600` |
| `ENABLE_NEW_FLOW` | Feature flag for new code | `false` | `true`, `false` |

## Resource Limits

### Production

- **Replicas**: 10 (autoscale 5-20)
- **CPU Request**: 500m
- **CPU Limit**: 1000m
- **Memory Request**: 512Mi
- **Memory Limit**: 1Gi

### Staging

- **Replicas**: 4 (autoscale 2-8)
- **CPU Request**: 250m
- **CPU Limit**: 500m
- **Memory Request**: 256Mi
- **Memory Limit**: 512Mi

Decision Log

Document important architectural decisions:

# ADR-015: Use Canary Deployments for Production

**Date**: 2026-02-01  
**Status**: Accepted  
**Deciders**: Platform Team

## Context

We've had several incidents caused by deployments that passed all automated tests but failed under production load patterns. We need a way to validate deployments with real traffic before full rollout.

## Decision

We will use canary deployments for all production services via Argo Rollouts.

**Canary strategy:**
- 10% traffic for 5 minutes
- 25% traffic for 5 minutes
- 50% traffic for 5 minutes
- 75% traffic for 5 minutes
- 100% if all checks pass

**Automated rollback if:**
- Error rate > 1%
- P95 latency > SLO + 50%
- Any metric anomaly detected

## Consequences

**Positive:**
- Reduced blast radius of bad deployments
- Catch production-specific issues before full rollout
- Automated rollback reduces MTTR

**Negative:**
- Deployments take 20+ minutes (vs. 5 minutes for rolling update)
- More complex to set up and maintain
- Requires Prometheus metrics instrumentation

**Neutral:**
- Need to train team on new deployment process
- Runbooks must be updated

## Alternatives Considered

1. **Blue/Green**: Rejected due to 2x infrastructure cost
2. **Rolling Update**: Current approach, insufficient safety
3. **Manual Gate**: Too slow, still vulnerable to human error

Create a culture of documentation:

Every Friday, someone presents a 15-minute topic:

"Deep dive: How our circuit breaker works"
"Postmortem review: Last week's database incident"
"New tool: Introduction to k9s for Kubernetes"
"Architecture walkthrough: Payment processing flow"

Record these sessions and add to the knowledge base.

New Engineer Onboarding

Checklist for new platform engineers:

## Platform Team Onboarding Checklist

### Week 1: Foundations
- [ ] Set up development environment
- [ ] Deploy sample application to dev
- [ ] Read architecture documentation
- [ ] Shadow on-call engineer (don't take pages yet)
- [ ] Review top 5 runbooks

### Week 2: Hands-On
- [ ] Fix a "good first issue" bug
- [ ] Participate in runbook drill (observer)
- [ ] Attend weekly knowledge sharing
- [ ] Read last 3 incident postmortems
- [ ] Set up monitoring dashboards

### Week 3: Going Deeper
- [ ] Take on larger feature or bug fix
- [ ] Lead runbook drill for one service
- [ ] Participate in production deployment
- [ ] Shadow incident response (if incident occurs)

### Week 4: On-Call Prep
- [ ] Complete on-call training
- [ ] Test PagerDuty setup
- [ ] Run through all critical runbooks
- [ ] Pair with senior engineer on incident
- [ ] **Ready for secondary on-call**

### Month 2-3
- [ ] Take primary on-call shifts
- [ ] Lead incident response (if incident occurs)
- [ ] Write or update a runbook
- [ ] Present at knowledge sharing
- [ ] Mentor next new hire

Key Takeaways

Runbooks should be action-oriented: Steps to take, not concepts to learn
On-call should be sustainable: Fair compensation, reasonable rotation, good handoffs
Documentation decays: Set reminders to review quarterly
Test your runbooks: Regular drills ensure they work when you need them
Reduce toil: Automate repetitive operational tasks
Share knowledge: Documentation is good, but teaching is better

Conclusion: Release Engineering as a Practice

Over the course of this series, we've covered the full spectrum of release and reliability engineering:

Introduction: Philosophy and principles
Deployment Strategies: Blue/green, canary, rollbacks
CI/CD Pipelines: Testing gates and promotion flows
Release Management: Integrating Jira, GitHub, ArgoCD, Kubernetes
Standardization: Reproducible deployments and configuration as code
Reliability Metrics: SLOs, error budgets, and uptime practices
Incident Response: Detection, triage, mitigation, and prevention
Operational Excellence: Runbooks, on-call, and documentation

These practices didn't happen overnight. They evolved through years of incidents, postmortems, and incremental improvements. Start small: pick one area that causes the most pain and improve it. Then move to the next.

Remember: the goal isn't perfection—it's resilience. Systems will fail. The question is how quickly you detect, respond, and learn from those failures.

Good luck building reliable, operable systems. May your deployments be boring and your on-call shifts quiet.

Previous: Part 7: Incident Response and Management Back to Series Overview

PreviousPart 7: Incident Response and Management NextReliability Metrics

Last updated 19 hours ago

hashtagThe Runbook That Saved My Weekend

hashtagThe Three Pillars of Operational Excellence

hashtagRunbooks: Your 3 AM Friend

hashtagRunbook Template

hashtagCause 2: Database Connection Issues

hashtagCause 3: Downstream Service Failure

hashtagCause 4: Resource Exhaustion

hashtagStill Not Resolved?

hashtagAfter Resolution

hashtagRelated Runbooks

hashtagUseful Commands

hashtagContact Information

hashtagRunbook Testing

hashtagOn-Call Practices

hashtagOn-Call Rotation

hashtagOn-Call Expectations

hashtagHandoff Template

hashtagOn-Call Compensation

hashtagReducing Alert Fatigue

hashtagOperational Documentation

hashtagArchitecture Diagrams

hashtagDependency Matrix

hashtagConfiguration Reference

hashtagDecision Log

hashtagKnowledge Sharing

hashtagWeekly Knowledge Sharing

hashtagNew Engineer Onboarding

hashtagKey Takeaways

hashtagConclusion: Release Engineering as a Practice

The Runbook That Saved My Weekend

The Three Pillars of Operational Excellence

Runbooks: Your 3 AM Friend

Runbook Template

Cause 2: Database Connection Issues

Cause 3: Downstream Service Failure

Cause 4: Resource Exhaustion

Still Not Resolved?

After Resolution

Related Runbooks

Useful Commands

Contact Information

Runbook Testing

On-Call Practices

On-Call Rotation

On-Call Expectations

Handoff Template

On-Call Compensation

Reducing Alert Fatigue

Operational Documentation

Architecture Diagrams

Dependency Matrix

Configuration Reference

Decision Log

Knowledge Sharing

Weekly Knowledge Sharing

New Engineer Onboarding

Key Takeaways

Conclusion: Release Engineering as a Practice