Part 6: Automation and Toil Reduction - Working Smarter, Not Harder
The Wake-Up Call: My Toil Spreadsheet
Manual deployments (SSH + commands): 3.5 hours
Responding to known issues: 2.0 hours
Manual database backups: 1.5 hours
Checking logs for errors: 2.5 hours
Restarting hung processes: 1.0 hour
SSL certificate renewal: 0.5 hour
---------------------------------------------------
Total toil: 11.0 hours
Feature development: 4.0 hoursToil (still manual): 1.5 hours
Feature development: 10.0 hours
Automation improvements: 3.5 hoursWhat is Toil?
1. Manual
2. Repetitive
3. Automatable
4. Tactical
5. No Enduring Value
6. Scales Linearly
What is NOT Toil?
Engineering Work
Project Work
Overhead
Learning
Measuring Toil in Your Workflow
My Toil Tracking Sheet
Date
Task
Time Spent
Category
Automatable?
Toil Categories I Track
Automating Deployments: My Biggest Win
Before: Manual Deployment Process
After: Automated CI/CD Pipeline
Safe Deployments with Health Checks
Building Self-Healing Systems
Self-Healing Pattern 1: Automatic Restarts
Self-Healing Pattern 2: Circuit Breakers
Self-Healing Pattern 3: Automatic Retry with Backoff
Automating Incident Response
Auto-Remediation Example: Out of Memory
Auto-Remediation Example: Stuck Processes
When NOT to Automate
Anti-Pattern 1: Automating Before Understanding
Anti-Pattern 2: Over-Engineered Automation
Anti-Pattern 3: Automating Judgment Calls
My Automation Decision Framework
My Toil Reduction Roadmap
Phase 1: Quick Wins (Month 1)
Phase 2: Common Toil (Months 2-3)
Phase 3: Advanced Automation (Months 4-6)
Measuring Success
Key Takeaways
Conclusion
Resources
Final Thoughts on the SRE Journey
PreviousPart 5: Capacity Planning and Performance - Growing Without BreakingNextPart 7: Programming for Reliability - Building Systems That Don't Break
Last updated