Part 6: Automation and Toil Reduction - Working Smarter, Not Harder

What You'll Learn: This article shares my journey from spending 10+ hours per week on repetitive manual tasks to automating almost everything. You'll learn what counts as toil (and what doesn't), how to measure and track toil in your workflow, implement automated deployments with CI/CD, build self-healing systems in Go, and crucially - when NOT to automate. By the end, you'll know how to systematically eliminate toil and reclaim your time for meaningful work.

The Wake-Up Call: My Toil Spreadsheet

Six months into running my personal expense tracking API, I decided to track how I spent my time for one week. The results shocked me:

Manual deployments (SSH + commands):        3.5 hours
Responding to known issues:                 2.0 hours
Manual database backups:                    1.5 hours
Checking logs for errors:                   2.5 hours
Restarting hung processes:                  1.0 hour
SSL certificate renewal:                    0.5 hour
---------------------------------------------------
Total toil:                                11.0 hours

Feature development:                        4.0 hours

I was spending 73% of my time on repetitive manual work that provided zero lasting value. Every week, I'd do the same tasks again. It was a hamster wheel.

That week, I committed to a mission: automate everything that doesn't require human judgment.

Three months later, my weekly time breakdown looked like this:

Toil (still manual):                        1.5 hours
Feature development:                       10.0 hours
Automation improvements:                    3.5 hours

I got 8.5 hours back per week. That's 442 hours per year - more than 11 work weeks.

What is Toil?

Google's SRE book defines toil as:

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth.

Let me break down each characteristic:

1. Manual

Requires a human to execute. Examples:

  • SSH-ing into servers to deploy

  • Manually running database migrations

  • Clicking through a UI to restart services

2. Repetitive

You do it over and over. Examples:

  • Deploying code (multiple times per week)

  • Responding to the same alert (same root cause)

  • Running the same diagnostic commands

3. Automatable

A machine could do it. Examples:

  • Running a deployment script

  • Restarting a service when memory is high

  • Rotating logs

4. Tactical

Interrupt-driven, reactive. Examples:

  • Responding to alerts

  • Firefighting incidents

  • Emergency patches

5. No Enduring Value

After you do it, nothing has permanently improved. Examples:

  • Manually restarting a service (you'll do it again tomorrow)

  • Manually deploying (you'll deploy again next week)

  • Manually checking logs (you'll check again later)

6. Scales Linearly

As your service grows, the work grows proportionally. Examples:

  • More services = more manual deployments

  • More users = more support tickets

  • More servers = more manual configuration

What is NOT Toil?

Not all operational work is toil. These are valuable:

Engineering Work

Building automation, improving architecture, fixing root causes.

Example: Writing a script to automate deployments is NOT toil. Running deployments manually IS toil.

Project Work

Planned improvements with lasting value.

Example: Migrating to Kubernetes, redesigning database schema, implementing new features.

Overhead

Necessary coordination and communication.

Example: Team meetings, code reviews, documentation, planning.

Learning

Debugging novel issues, researching solutions.

Example: Investigating a new type of incident, learning a new technology.

Measuring Toil in Your Workflow

Before automating, measure your toil. You can't improve what you don't measure.

My Toil Tracking Sheet

I created a simple spreadsheet to track toil for one month:

Date
Task
Time Spent
Category
Automatable?

2024-02-01

Deploy API update

15 min

Deployment

Yes

2024-02-01

Restart API (OOM)

10 min

Incident

Yes

2024-02-02

Deploy bug fix

15 min

Deployment

Yes

2024-02-03

Manual DB backup

20 min

Maintenance

Yes

2024-02-05

Check logs for errors

30 min

Monitoring

Yes

2024-02-06

Debug new error

2 hours

Engineering

No

2024-02-07

Team meeting

1 hour

Overhead

No

After one month, I calculated:

Toil Categories I Track

  1. Deployment toil: Manual deploy steps

  2. Incident toil: Responding to known issues

  3. Maintenance toil: Backups, log rotation, cert renewal

  4. Monitoring toil: Manually checking dashboards

  5. Configuration toil: Manually updating config files

Automating Deployments: My Biggest Win

Manual deployments were my #1 toil category at 3.5 hours per week. Here's how I automated them completely.

Before: Manual Deployment Process

My old deployment process (15-20 minutes per deploy):

Boring, repetitive, error-prone (I once forgot step 6 and wondered why the service was down).

After: Automated CI/CD Pipeline

Now I push code and GitHub Actions handles everything:

Now deployment is:

  1. git push origin main

  2. Watch GitHub Actions (or do something else)

  3. Get Slack notification when done

Time saved: 15 minutes per deploy Γ— 10 deploys per month = 2.5 hours/month

Safe Deployments with Health Checks

My Kubernetes deployment includes automated health checks:

If a new version fails health checks, Kubernetes automatically rolls back. Zero manual intervention.

Building Self-Healing Systems

The best automation is the kind that fixes problems without waking you up.

Self-Healing Pattern 1: Automatic Restarts

My Go services automatically restart if they crash:

Self-Healing Pattern 2: Circuit Breakers

When downstream services fail, my Go apps protect themselves:

Usage in my API:

When the payment service is down, the circuit breaker opens and my API fails fast instead of hanging.

Self-Healing Pattern 3: Automatic Retry with Backoff

For transient failures, automatic retry fixes most issues:

Usage:

Transient database errors (connection timeout, deadlock) are automatically retried.

Automating Incident Response

Some incidents can be fully automated away.

Auto-Remediation Example: Out of Memory

Before automation, "API OOM" alert would page me at 2 AM:

Now, Kubernetes handles it automatically:

Prometheus alert:

Single OOM events are auto-remediated. Only repeated OOMs page me.

Auto-Remediation Example: Stuck Processes

I had an issue where some goroutines would hang, causing slow memory leak. I automated detection and remediation:

In main:

When goroutine leak is detected, the process gracefully shuts down and Kubernetes restarts it. Problem fixed automatically.

When NOT to Automate

I learned this the hard way: not everything should be automated.

Anti-Pattern 1: Automating Before Understanding

Early on, I automated database vacuum without understanding when it was needed. Result: vacuum ran during peak traffic, causing performance issues.

Lesson: Understand a task thoroughly before automating it. Run it manually a few times first.

Anti-Pattern 2: Over-Engineered Automation

I once spent 3 weeks building a complex auto-scaling system for a task that ran once per month. The automation took longer to build than manually doing the task for years.

Lesson: Calculate ROI before automating. If manual task takes 10 min/month and automation takes 2 weeks, you need 168 months to break even.

Anti-Pattern 3: Automating Judgment Calls

Some decisions require human judgment. I tried to automate incident severity classification and it constantly got it wrong.

Lesson: Automate mechanical tasks, not judgment calls.

My Automation Decision Framework

Before automating anything, I ask:

  1. Frequency: How often do I do this?

    • Daily/Weekly β†’ High priority to automate

    • Monthly β†’ Medium priority

    • Yearly β†’ Low priority (probably not worth it)

  2. Time per occurrence: How long does it take?

    • 30 min β†’ High priority

    • 10-30 min β†’ Medium priority

    • < 10 min β†’ Only if very frequent

  3. Risk if done wrong: What happens if it fails?

    • Low risk β†’ Automate aggressively

    • High risk β†’ Add safeguards and human approval

    • Critical β†’ Maybe keep manual

  4. Complexity to automate: How hard is it?

    • Easy script β†’ Do it now

    • Medium complexity β†’ Plan it

    • Very complex β†’ Question if worth it

  5. ROI calculation:

Example calculation:

My Toil Reduction Roadmap

Here's the order I automated tasks in:

Phase 1: Quick Wins (Month 1)

Impact: Saved 5 hours/week

Phase 2: Common Toil (Months 2-3)

Impact: Saved additional 3 hours/week

Phase 3: Advanced Automation (Months 4-6)

Impact: Saved additional 2 hours/week

Measuring Success

I track these metrics to measure toil reduction:

My dashboard shows:

Key Takeaways

  1. Measure toil first. Track your time for a month to find the biggest opportunities.

  2. Toil isn't all operational work. Engineering, projects, and learning are valuable - only repetitive, manual, automatable tasks are toil.

  3. Start with deployments. For most teams, this is the biggest time sink and easiest to automate.

  4. Build self-healing systems. The best automation is the kind that fixes problems without human intervention.

  5. Calculate ROI before automating. Don't spend 3 weeks automating a 5-minute monthly task.

  6. Some things shouldn't be automated. Judgment calls, critical decisions, and rarely-performed tasks often aren't worth automating.

Conclusion

When I started tracking my time, I was shocked to find 73% was toil. Now it's under 10%. That's 8+ hours per week I got back - time I now spend building features, improving reliability, and honestly, not working weekends.

The key is to be systematic:

  1. Measure your toil

  2. Prioritize by ROI

  3. Automate ruthlessly

  4. Build self-healing systems

  5. Keep measuring

Start small. Pick one repetitive task this week and automate it. Then do another next week. In a few months, you'll wonder how you ever did things manually.

Resources

Final Thoughts on the SRE Journey

This series started with my 2 AM wake-up call and the realization that I needed to treat operations as a software problem. Through the journey, we covered:

  • Part 1: SRE fundamentals and building reliability into Go services from the start

  • Part 2: Defining meaningful SLIs, SLOs, and using error budgets to guide decisions

  • Part 3: Building comprehensive observability with metrics, logs, and traces

  • Part 4: Managing incidents professionally with processes and post-mortems

  • Part 5: Planning capacity and optimizing performance proactively

  • Part 6: Eliminating toil through systematic automation

The transformation from reactive firefighting to proactive reliability engineering doesn't happen overnight. But each step - instrumenting one service, writing one runbook, automating one task - compounds over time.

You don't need to be Google-scale to benefit from SRE practices. Start small, measure everything, and improve systematically. Your future self will thank you.

Now go build reliable systems.

Last updated