Part 8: Operational Excellence
The Runbook That Saved My Weekend
It was Saturday at 2 PM when a critical alert fired. I was on-call, but thankfully I wasn't aloneβthe runbook I'd written three months earlier walked me through the exact steps to resolve the issue. Fifteen minutes later, the incident was resolved, and I was back to my weekend.
That's the power of good operational documentation. But I learned this the hard way. Early in my career, I'd get paged at 3 AM, scramble to remember what to do, and waste precious minutes searching through Slack history and old tickets. Now, our team has comprehensive runbooks, clear on-call practices, and documentation that actually helps during emergencies.
The Three Pillars of Operational Excellence
Runbooks: Step-by-step guides for resolving common issues
On-Call Practices: Sustainable rotation and response protocols
Operational Documentation: Architecture diagrams, dependencies, and decision logs
Let me show you how I build each pillar.
Runbooks: Your 3 AM Friend
A runbook is a documented procedure for handling operational tasks, particularly during incidents. Good runbooks are:
Action-oriented: Steps to take, not concepts to understand
Tested regularly: Run through them during drills
Easy to find: Linked from alerts and dashboards
Maintained: Updated after every incident
Runbook Template
Expected time: 3-5 minutes Success criteria: Error rate drops below 1%
Cause 2: Database Connection Issues
Symptoms:
Errors are "connection timeout" or "too many connections"
Database dashboard shows high connection count
Resolution: Restart pods to clear connection pool
Expected time: 2-3 minutes Success criteria: Connection pool below 80%, error rate normal
Cause 3: Downstream Service Failure
Symptoms:
Errors are 502 Bad Gateway or 504 Gateway Timeout
One specific endpoint failing
Dependency dashboard shows failures
Resolution: Enable circuit breaker
Expected time: 1 minute Success criteria: No more 502s, service degraded but functional
Cause 4: Resource Exhaustion
Symptoms:
Pods showing high CPU/memory usage
Slow response times before errors
OOMKilled in pod events
Resolution: Scale up immediately
Expected time: 2-4 minutes Success criteria: CPU/memory below 70%, error rate normal
Still Not Resolved?
If none of the above worked:
Escalate: Page tech lead via PagerDuty:
pd escalateEnable maintenance mode: Buy time to investigate
Check recent changes:
Database migrations:
./scripts/check-migrations.shInfrastructure changes: Check Terraform Cloud runs
Configuration changes: Check ArgoCD sync history
After Resolution
Document what happened: Add comment to PagerDuty incident
Update metrics: Note resolution time and method
Schedule postmortem: If SEV1/SEV2, create postmortem doc
Update this runbook: If you found new information
Related Runbooks
Useful Commands
Contact Information
Team Slack:
#platform-teamIncident Channel:
#incidentsTech Lead: @alice (primary), @bob (backup)
PagerDuty: Platform Team escalation policy
Last Updated: 2026-02-17 Last Tested: 2026-02-10 (during monthly drill)
docs/runbooks/ βββ README.md # Runbook index βββ services/ β βββ payment-api/ β β βββ high-error-rate.md β β βββ high-latency.md β β βββ pod-crashloop.md β β βββ deployment-issues.md β βββ user-service/ β βββ ... βββ infrastructure/ β βββ database-connection-pool.md β βββ redis-cache-miss.md β βββ kubernetes-node-not-ready.md β βββ load-balancer-issues.md βββ operations/ β βββ deployment-rollback.md β βββ maintenance-mode.md β βββ scaling-services.md β βββ database-migration.md βββ templates/ βββ runbook-template.md
Runbook Testing
I test runbooks quarterly during scheduled drills:
After each drill, update the runbook based on findings.
On-Call Practices
Being on-call can be stressful. Good practices make it sustainable.
On-Call Rotation
I use a follow-the-sun rotation when possible:
US shifts: 9 AM - 5 PM EST (primary), 5 PM - 9 AM EST (secondary)
EU shifts: 9 AM - 5 PM CET (primary), 5 PM - 9 AM CET (secondary)
Weekend: 24-hour shifts with higher compensation
Rotation schedule: One week on-call, two weeks off
This ensures:
No one is on-call for more than one week at a time
Always two people on-call (primary and secondary)
Minimal middle-of-the-night pages for primary
On-Call Expectations
Before your shift:
During your shift:
After your shift:
Handoff Template
On-Call Compensation
Fair compensation is critical for sustainable on-call:
Base on-call pay: $200/week (just for being on-call)
Incident pay: $50/hour for time spent on incidents
Weekend premium: 1.5x incident pay
Comp time: Option to take time off after heavy on-call weeks
I track this automatically:
Reducing Alert Fatigue
Problem: Too many alerts lead to ignoring them.
My solution: Alert on symptoms, not causes. Alert on impact, not potential impact.
Bad alerts:
β "CPU usage above 80%"
β "Disk space above 70%"
β "Memory usage trending up"
Good alerts:
β "Error rate above 5% (users affected)"
β "P95 latency above SLO (user experience degraded)"
β "Disk space will be full in < 4 hours (action required)"
Alert severity criteria:
Critical (page immediately): Customer impact now
Warning (Slack notification): Will become critical if not addressed in 4+ hours
Info (dashboard only): Good to know, no action required
I track alert quality:
Operational Documentation
Beyond runbooks, maintain these key documents:
Architecture Diagrams
Dependency Matrix
Configuration Reference
Decision Log
Document important architectural decisions:
Knowledge Sharing
Create a culture of documentation:
Weekly Knowledge Sharing
Every Friday, someone presents a 15-minute topic:
"Deep dive: How our circuit breaker works"
"Postmortem review: Last week's database incident"
"New tool: Introduction to k9s for Kubernetes"
"Architecture walkthrough: Payment processing flow"
Record these sessions and add to the knowledge base.
New Engineer Onboarding
Checklist for new platform engineers:
Key Takeaways
Runbooks should be action-oriented: Steps to take, not concepts to learn
On-call should be sustainable: Fair compensation, reasonable rotation, good handoffs
Documentation decays: Set reminders to review quarterly
Test your runbooks: Regular drills ensure they work when you need them
Reduce toil: Automate repetitive operational tasks
Share knowledge: Documentation is good, but teaching is better
Conclusion: Release Engineering as a Practice
Over the course of this series, we've covered the full spectrum of release and reliability engineering:
Introduction: Philosophy and principles
Deployment Strategies: Blue/green, canary, rollbacks
CI/CD Pipelines: Testing gates and promotion flows
Release Management: Integrating Jira, GitHub, ArgoCD, Kubernetes
Standardization: Reproducible deployments and configuration as code
Reliability Metrics: SLOs, error budgets, and uptime practices
Incident Response: Detection, triage, mitigation, and prevention
Operational Excellence: Runbooks, on-call, and documentation
These practices didn't happen overnight. They evolved through years of incidents, postmortems, and incremental improvements. Start small: pick one area that causes the most pain and improve it. Then move to the next.
Remember: the goal isn't perfectionβit's resilience. Systems will fail. The question is how quickly you detect, respond, and learn from those failures.
Good luck building reliable, operable systems. May your deployments be boring and your on-call shifts quiet.
Previous: Part 7: Incident Response and Management Back to Series Overview
Last updated