Part 8: Operational Excellence

The Runbook That Saved My Weekend

It was Saturday at 2 PM when a critical alert fired. I was on-call, but thankfully I wasn't aloneβ€”the runbook I'd written three months earlier walked me through the exact steps to resolve the issue. Fifteen minutes later, the incident was resolved, and I was back to my weekend.

That's the power of good operational documentation. But I learned this the hard way. Early in my career, I'd get paged at 3 AM, scramble to remember what to do, and waste precious minutes searching through Slack history and old tickets. Now, our team has comprehensive runbooks, clear on-call practices, and documentation that actually helps during emergencies.

The Three Pillars of Operational Excellence

  1. Runbooks: Step-by-step guides for resolving common issues

  2. On-Call Practices: Sustainable rotation and response protocols

  3. Operational Documentation: Architecture diagrams, dependencies, and decision logs

Let me show you how I build each pillar.

Runbooks: Your 3 AM Friend

A runbook is a documented procedure for handling operational tasks, particularly during incidents. Good runbooks are:

  • Action-oriented: Steps to take, not concepts to understand

  • Tested regularly: Run through them during drills

  • Easy to find: Linked from alerts and dashboards

  • Maintained: Updated after every incident

Runbook Template

Expected time: 3-5 minutes Success criteria: Error rate drops below 1%

Cause 2: Database Connection Issues

Symptoms:

  • Errors are "connection timeout" or "too many connections"

  • Database dashboard shows high connection count

Resolution: Restart pods to clear connection pool

Expected time: 2-3 minutes Success criteria: Connection pool below 80%, error rate normal

Cause 3: Downstream Service Failure

Symptoms:

  • Errors are 502 Bad Gateway or 504 Gateway Timeout

  • One specific endpoint failing

  • Dependency dashboard shows failures

Resolution: Enable circuit breaker

Expected time: 1 minute Success criteria: No more 502s, service degraded but functional

Cause 4: Resource Exhaustion

Symptoms:

  • Pods showing high CPU/memory usage

  • Slow response times before errors

  • OOMKilled in pod events

Resolution: Scale up immediately

Expected time: 2-4 minutes Success criteria: CPU/memory below 70%, error rate normal

Still Not Resolved?

If none of the above worked:

  1. Escalate: Page tech lead via PagerDuty: pd escalate

  2. Enable maintenance mode: Buy time to investigate

  3. Check recent changes:

    • Database migrations: ./scripts/check-migrations.sh

    • Infrastructure changes: Check Terraform Cloud runs

    • Configuration changes: Check ArgoCD sync history

After Resolution

  1. Document what happened: Add comment to PagerDuty incident

  2. Update metrics: Note resolution time and method

  3. Schedule postmortem: If SEV1/SEV2, create postmortem doc

  4. Update this runbook: If you found new information

Useful Commands

Contact Information

  • Team Slack: #platform-team

  • Incident Channel: #incidents

  • Tech Lead: @alice (primary), @bob (backup)

  • PagerDuty: Platform Team escalation policy

Last Updated: 2026-02-17 Last Tested: 2026-02-10 (during monthly drill)

docs/runbooks/ β”œβ”€β”€ README.md # Runbook index β”œβ”€β”€ services/ β”‚ β”œβ”€β”€ payment-api/ β”‚ β”‚ β”œβ”€β”€ high-error-rate.md β”‚ β”‚ β”œβ”€β”€ high-latency.md β”‚ β”‚ β”œβ”€β”€ pod-crashloop.md β”‚ β”‚ └── deployment-issues.md β”‚ └── user-service/ β”‚ └── ... β”œβ”€β”€ infrastructure/ β”‚ β”œβ”€β”€ database-connection-pool.md β”‚ β”œβ”€β”€ redis-cache-miss.md β”‚ β”œβ”€β”€ kubernetes-node-not-ready.md β”‚ └── load-balancer-issues.md β”œβ”€β”€ operations/ β”‚ β”œβ”€β”€ deployment-rollback.md β”‚ β”œβ”€β”€ maintenance-mode.md β”‚ β”œβ”€β”€ scaling-services.md β”‚ └── database-migration.md └── templates/ └── runbook-template.md

Runbook Testing

I test runbooks quarterly during scheduled drills:

After each drill, update the runbook based on findings.

On-Call Practices

Being on-call can be stressful. Good practices make it sustainable.

On-Call Rotation

I use a follow-the-sun rotation when possible:

  • US shifts: 9 AM - 5 PM EST (primary), 5 PM - 9 AM EST (secondary)

  • EU shifts: 9 AM - 5 PM CET (primary), 5 PM - 9 AM CET (secondary)

  • Weekend: 24-hour shifts with higher compensation

Rotation schedule: One week on-call, two weeks off

This ensures:

  • No one is on-call for more than one week at a time

  • Always two people on-call (primary and secondary)

  • Minimal middle-of-the-night pages for primary

On-Call Expectations

Before your shift:

During your shift:

After your shift:

Handoff Template

On-Call Compensation

Fair compensation is critical for sustainable on-call:

  • Base on-call pay: $200/week (just for being on-call)

  • Incident pay: $50/hour for time spent on incidents

  • Weekend premium: 1.5x incident pay

  • Comp time: Option to take time off after heavy on-call weeks

I track this automatically:

Reducing Alert Fatigue

Problem: Too many alerts lead to ignoring them.

My solution: Alert on symptoms, not causes. Alert on impact, not potential impact.

Bad alerts:

  • ❌ "CPU usage above 80%"

  • ❌ "Disk space above 70%"

  • ❌ "Memory usage trending up"

Good alerts:

  • βœ… "Error rate above 5% (users affected)"

  • βœ… "P95 latency above SLO (user experience degraded)"

  • βœ… "Disk space will be full in < 4 hours (action required)"

Alert severity criteria:

  • Critical (page immediately): Customer impact now

  • Warning (Slack notification): Will become critical if not addressed in 4+ hours

  • Info (dashboard only): Good to know, no action required

I track alert quality:

Operational Documentation

Beyond runbooks, maintain these key documents:

Architecture Diagrams

Dependency Matrix

Configuration Reference

Decision Log

Document important architectural decisions:

Knowledge Sharing

Create a culture of documentation:

Weekly Knowledge Sharing

Every Friday, someone presents a 15-minute topic:

  • "Deep dive: How our circuit breaker works"

  • "Postmortem review: Last week's database incident"

  • "New tool: Introduction to k9s for Kubernetes"

  • "Architecture walkthrough: Payment processing flow"

Record these sessions and add to the knowledge base.

New Engineer Onboarding

Checklist for new platform engineers:

Key Takeaways

  1. Runbooks should be action-oriented: Steps to take, not concepts to learn

  2. On-call should be sustainable: Fair compensation, reasonable rotation, good handoffs

  3. Documentation decays: Set reminders to review quarterly

  4. Test your runbooks: Regular drills ensure they work when you need them

  5. Reduce toil: Automate repetitive operational tasks

  6. Share knowledge: Documentation is good, but teaching is better

Conclusion: Release Engineering as a Practice

Over the course of this series, we've covered the full spectrum of release and reliability engineering:

  1. Introduction: Philosophy and principles

  2. Deployment Strategies: Blue/green, canary, rollbacks

  3. CI/CD Pipelines: Testing gates and promotion flows

  4. Release Management: Integrating Jira, GitHub, ArgoCD, Kubernetes

  5. Standardization: Reproducible deployments and configuration as code

  6. Reliability Metrics: SLOs, error budgets, and uptime practices

  7. Incident Response: Detection, triage, mitigation, and prevention

  8. Operational Excellence: Runbooks, on-call, and documentation

These practices didn't happen overnight. They evolved through years of incidents, postmortems, and incremental improvements. Start small: pick one area that causes the most pain and improve it. Then move to the next.

Remember: the goal isn't perfectionβ€”it's resilience. Systems will fail. The question is how quickly you detect, respond, and learn from those failures.

Good luck building reliable, operable systems. May your deployments be boring and your on-call shifts quiet.


Previous: Part 7: Incident Response and Management Back to Series Overview

Last updated