Part 6: Automation and Toil Reduction - Working Smarter, Not Harder

What You'll Learn: This article shares my journey from spending 10+ hours per week on repetitive manual tasks to automating almost everything. You'll learn what counts as toil (and what doesn't), how to measure and track toil in your workflow, implement automated deployments with CI/CD, build self-healing systems in Go, and crucially - when NOT to automate. By the end, you'll know how to systematically eliminate toil and reclaim your time for meaningful work.

The Wake-Up Call: My Toil Spreadsheet

Six months into running my personal expense tracking API, I decided to track how I spent my time for one week. The results shocked me:

Manual deployments (SSH + commands):        3.5 hours
Responding to known issues:                 2.0 hours
Manual database backups:                    1.5 hours
Checking logs for errors:                   2.5 hours
Restarting hung processes:                  1.0 hour
SSL certificate renewal:                    0.5 hour
---------------------------------------------------
Total toil:                                11.0 hours

Feature development:                        4.0 hours

I was spending 73% of my time on repetitive manual work that provided zero lasting value. Every week, I'd do the same tasks again. It was a hamster wheel.

That week, I committed to a mission: automate everything that doesn't require human judgment.

Three months later, my weekly time breakdown looked like this:

Toil (still manual):                        1.5 hours
Feature development:                       10.0 hours
Automation improvements:                    3.5 hours

I got 8.5 hours back per week. That's 442 hours per year - more than 11 work weeks.

What is Toil?

Google's SRE book defines toil as:

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth.

Let me break down each characteristic:

1. Manual

Requires a human to execute. Examples:

SSH-ing into servers to deploy
Manually running database migrations
Clicking through a UI to restart services

2. Repetitive

You do it over and over. Examples:

Deploying code (multiple times per week)
Responding to the same alert (same root cause)
Running the same diagnostic commands

3. Automatable

A machine could do it. Examples:

Running a deployment script
Restarting a service when memory is high
Rotating logs

4. Tactical

Interrupt-driven, reactive. Examples:

Responding to alerts
Firefighting incidents
Emergency patches

5. No Enduring Value

After you do it, nothing has permanently improved. Examples:

Manually restarting a service (you'll do it again tomorrow)
Manually deploying (you'll deploy again next week)
Manually checking logs (you'll check again later)

6. Scales Linearly

As your service grows, the work grows proportionally. Examples:

More services = more manual deployments
More users = more support tickets
More servers = more manual configuration

What is NOT Toil?

Not all operational work is toil. These are valuable:

Engineering Work

Building automation, improving architecture, fixing root causes.

Example: Writing a script to automate deployments is NOT toil. Running deployments manually IS toil.

Project Work

Planned improvements with lasting value.

Example: Migrating to Kubernetes, redesigning database schema, implementing new features.

Overhead

Necessary coordination and communication.

Example: Team meetings, code reviews, documentation, planning.

Learning

Debugging novel issues, researching solutions.

Example: Investigating a new type of incident, learning a new technology.

Measuring Toil in Your Workflow

Before automating, measure your toil. You can't improve what you don't measure.

My Toil Tracking Sheet

I created a simple spreadsheet to track toil for one month:

Date

Task

Time Spent

Toil Categories I Track

Deployment toil: Manual deploy steps
Incident toil: Responding to known issues
Maintenance toil: Backups, log rotation, cert renewal
Monitoring toil: Manually checking dashboards
Configuration toil: Manually updating config files

Automating Deployments: My Biggest Win

Manual deployments were my #1 toil category at 3.5 hours per week. Here's how I automated them completely.

Before: Manual Deployment Process

My old deployment process (15-20 minutes per deploy):

# 1. SSH into server
ssh expense-api-server

# 2. Pull latest code
cd /opt/expense-api
git pull origin main

# 3. Build application
go build -o expense-api ./cmd/server

# 4. Stop service
sudo systemctl stop expense-api

# 5. Replace binary
sudo cp expense-api /usr/local/bin/

# 6. Start service
sudo systemctl start expense-api

# 7. Check logs
sudo journalctl -u expense-api -f -n 50

# 8. Test endpoint
curl http://localhost:8080/health/ready

# 9. Check metrics
# Open Grafana, check dashboard for errors

Boring, repetitive, error-prone (I once forgot step 6 and wondered why the service was down).

After: Automated CI/CD Pipeline

Now I push code and GitHub Actions handles everything:

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Go
        uses: actions/setup-go@v4
        with:
          go-version: '1.21'

      - name: Run tests
        run: go test -v ./...

      - name: Run linter
        uses: golangci/golangci-lint-action@v3
        with:
          version: latest

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2

      - name: Login to Container Registry
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: |
            ghcr.io/${{ github.repository }}:latest
            ghcr.io/${{ github.repository }}:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Kubernetes
        uses: azure/k8s-deploy@v4
        with:
          manifests: |
            k8s/deployment.yaml
          images: |
            ghcr.io/${{ github.repository }}:${{ github.sha }}
          kubectl-version: 'latest'

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/expense-api --timeout=5m

      - name: Smoke test
        run: |
          kubectl run smoke-test --rm -i --restart=Never --image=curlimages/curl -- \
            curl -f http://expense-api-service:8080/health/ready

      - name: Notify Slack
        if: always()
        uses: 8398a7/action-slack@v3
        with:
          status: ${{ job.status }}
          text: |
            Deployment ${{ job.status }}
            Commit: ${{ github.sha }}
            Author: ${{ github.actor }}
          webhook_url: ${{ secrets.SLACK_WEBHOOK }}

Now deployment is:

git push origin main
Watch GitHub Actions (or do something else)
Get Slack notification when done

Time saved: 15 minutes per deploy × 10 deploys per month = 2.5 hours/month

Safe Deployments with Health Checks

My Kubernetes deployment includes automated health checks:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: expense-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Never more than 1 pod down
      maxSurge: 1        # Never more than 1 extra pod
  template:
    spec:
      containers:
      - name: api
        image: ghcr.io/yourusername/expense-api:latest
        ports:
        - containerPort: 8080
        
        # Liveness: Is the app running?
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3
          
        # Readiness: Can the app serve traffic?
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 2
          
        # Graceful shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
        
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"

If a new version fails health checks, Kubernetes automatically rolls back. Zero manual intervention.

Building Self-Healing Systems

The best automation is the kind that fixes problems without waking you up.

Self-Healing Pattern 1: Automatic Restarts

My Go services automatically restart if they crash:

# k8s/deployment.yaml (continued)
spec:
  template:
    spec:
      containers:
      - name: api
        # If container exits, restart it
        restartPolicy: Always
        
        # Liveness probe will restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          failureThreshold: 3  # Restart after 3 failures

Self-Healing Pattern 2: Circuit Breakers

When downstream services fail, my Go apps protect themselves:

// pkg/circuitbreaker/breaker.go
package circuitbreaker

import (
    "context"
    "errors"
    "sync"
    "time"
)

type State int

const (
    StateClosed State = iota  // Normal operation
    StateOpen                 // Failing, reject requests
    StateHalfOpen            // Testing if recovered
)

type CircuitBreaker struct {
    maxFailures  int
    timeout      time.Duration
    
    mu           sync.RWMutex
    state        State
    failures     int
    lastFailTime time.Time
}

func NewCircuitBreaker(maxFailures int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        maxFailures: maxFailures,
        timeout:     timeout,
        state:       StateClosed,
    }
}

func (cb *CircuitBreaker) Call(ctx context.Context, fn func() error) error {
    // Check if circuit should transition from Open to Half-Open
    if cb.shouldAttemptReset() {
        cb.setState(StateHalfOpen)
    }
    
    // If circuit is open, fail fast
    if cb.getState() == StateOpen {
        return errors.New("circuit breaker is open")
    }
    
    // Execute function
    err := fn()
    
    // Record result
    if err != nil {
        cb.recordFailure()
    } else {
        cb.recordSuccess()
    }
    
    return err
}

func (cb *CircuitBreaker) recordFailure() {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    cb.failures++
    cb.lastFailTime = time.Now()
    
    if cb.failures >= cb.maxFailures {
        cb.state = StateOpen
    }
}

func (cb *CircuitBreaker) recordSuccess() {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    cb.failures = 0
    cb.state = StateClosed
}

func (cb *CircuitBreaker) shouldAttemptReset() bool {
    cb.mu.RLock()
    defer cb.mu.RUnlock()
    
    return cb.state == StateOpen && 
           time.Since(cb.lastFailTime) > cb.timeout
}

func (cb *CircuitBreaker) getState() State {
    cb.mu.RLock()
    defer cb.mu.RUnlock()
    return cb.state
}

func (cb *CircuitBreaker) setState(state State) {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    cb.state = state
}

Usage in my API:

// internal/clients/payment.go
package clients

type PaymentClient struct {
    httpClient *http.Client
    cb         *circuitbreaker.CircuitBreaker
}

func NewPaymentClient() *PaymentClient {
    return &PaymentClient{
        httpClient: &http.Client{Timeout: 5 * time.Second},
        cb:         circuitbreaker.NewCircuitBreaker(5, 30*time.Second),
    }
}

func (c *PaymentClient) ProcessPayment(ctx context.Context, req PaymentRequest) error {
    return c.cb.Call(ctx, func() error {
        // Make HTTP request to payment service
        resp, err := c.httpClient.Post("https://payment-service/process", ...)
        if err != nil {
            return err
        }
        defer resp.Body.Close()
        
        if resp.StatusCode >= 500 {
            return errors.New("payment service error")
        }
        
        return nil
    })
}

When the payment service is down, the circuit breaker opens and my API fails fast instead of hanging.

Self-Healing Pattern 3: Automatic Retry with Backoff

For transient failures, automatic retry fixes most issues:

// pkg/retry/retry.go
package retry

import (
    "context"
    "math"
    "time"
)

type Config struct {
    MaxAttempts int
    InitialDelay time.Duration
    MaxDelay     time.Duration
    Multiplier   float64
}

func WithExponentialBackoff(ctx context.Context, cfg Config, fn func() error) error {
    var lastErr error
    delay := cfg.InitialDelay
    
    for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
        // Try the operation
        err := fn()
        if err == nil {
            return nil  // Success!
        }
        
        lastErr = err
        
        // Don't retry on last attempt
        if attempt == cfg.MaxAttempts-1 {
            break
        }
        
        // Wait before retrying
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(delay):
        }
        
        // Exponential backoff with jitter
        delay = time.Duration(float64(delay) * cfg.Multiplier)
        if delay > cfg.MaxDelay {
            delay = cfg.MaxDelay
        }
    }
    
    return lastErr
}

Usage:

func (s *Store) CreateTransaction(ctx context.Context, txn Transaction) error {
    cfg := retry.Config{
        MaxAttempts:  3,
        InitialDelay: 100 * time.Millisecond,
        MaxDelay:     2 * time.Second,
        Multiplier:   2.0,
    }
    
    return retry.WithExponentialBackoff(ctx, cfg, func() error {
        _, err := s.db.ExecContext(ctx, 
            "INSERT INTO transactions (user_id, amount) VALUES ($1, $2)",
            txn.UserID, txn.Amount,
        )
        return err
    })
}

Transient database errors (connection timeout, deadlock) are automatically retried.

Automating Incident Response

Some incidents can be fully automated away.

Auto-Remediation Example: Out of Memory

Before automation, "API OOM" alert would page me at 2 AM:

2:17 AM - Alert: API container OOM killed
2:19 AM - Me (groggy): SSH into server
2:21 AM - Me: Restart service
2:23 AM - Me: Watch metrics
2:30 AM - Me: Go back to bed

Now, Kubernetes handles it automatically:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: expense-api
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          requests:
            memory: "256Mi"
          limits:
            memory: "512Mi"  # OOM kill at 512Mi
        
        # Automatically restart on OOM
        restartPolicy: Always
        
        # Alert me only if it happens 3 times in 5 minutes
        # (then it's a real problem, not a transient spike)

Prometheus alert:

- alert: FrequentOOMKills
  expr: rate(container_oom_events_total[5m]) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Frequent OOM kills detected"
    description: "Container {{ $labels.container }} being OOM killed repeatedly"

Single OOM events are auto-remediated. Only repeated OOMs page me.

Auto-Remediation Example: Stuck Processes

I had an issue where some goroutines would hang, causing slow memory leak. I automated detection and remediation:

// pkg/healthcheck/goroutine_monitor.go
package healthcheck

import (
    "context"
    "runtime"
    "time"

    "github.com/rs/zerolog/log"
)

type GoroutineMonitor struct {
    threshold int
    checkInterval time.Duration
    shutdownFunc func()
}

func NewGoroutineMonitor(threshold int, interval time.Duration, shutdownFunc func()) *GoroutineMonitor {
    return &GoroutineMonitor{
        threshold: threshold,
        checkInterval: interval,
        shutdownFunc: shutdownFunc,
    }
}

func (m *GoroutineMonitor) Start(ctx context.Context) {
    ticker := time.NewTicker(m.checkInterval)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            count := runtime.NumGoroutine()
            
            if count > m.threshold {
                log.Error().
                    Int("goroutine_count", count).
                    Int("threshold", m.threshold).
                    Msg("Goroutine leak detected, initiating graceful shutdown")
                
                // Trigger graceful shutdown
                // Kubernetes will restart the pod
                m.shutdownFunc()
                return
            }
        }
    }
}

In main:

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    
    // Start goroutine monitor
    monitor := healthcheck.NewGoroutineMonitor(
        10000,              // Alert if > 10k goroutines
        30*time.Second,     // Check every 30s
        cancel,             // Shutdown function
    )
    go monitor.Start(ctx)
    
    // ... rest of application
}

When goroutine leak is detected, the process gracefully shuts down and Kubernetes restarts it. Problem fixed automatically.

When NOT to Automate

I learned this the hard way: not everything should be automated.

Anti-Pattern 1: Automating Before Understanding

Early on, I automated database vacuum without understanding when it was needed. Result: vacuum ran during peak traffic, causing performance issues.

Lesson: Understand a task thoroughly before automating it. Run it manually a few times first.

Anti-Pattern 2: Over-Engineered Automation

I once spent 3 weeks building a complex auto-scaling system for a task that ran once per month. The automation took longer to build than manually doing the task for years.

Lesson: Calculate ROI before automating. If manual task takes 10 min/month and automation takes 2 weeks, you need 168 months to break even.

Anti-Pattern 3: Automating Judgment Calls

Some decisions require human judgment. I tried to automate incident severity classification and it constantly got it wrong.

Lesson: Automate mechanical tasks, not judgment calls.

My Automation Decision Framework

Before automating anything, I ask:

Frequency: How often do I do this?
- Daily/Weekly → High priority to automate
- Monthly → Medium priority
- Yearly → Low priority (probably not worth it)
Time per occurrence: How long does it take?
- 30 min → High priority
- 10-30 min → Medium priority
- < 10 min → Only if very frequent
Risk if done wrong: What happens if it fails?
- Low risk → Automate aggressively
- High risk → Add safeguards and human approval
- Critical → Maybe keep manual
Complexity to automate: How hard is it?
- Easy script → Do it now
- Medium complexity → Plan it
- Very complex → Question if worth it

ROI calculation:

ROI = (Time saved per year) / (Time to build automation)

If ROI > 2, automate it
If ROI < 1, probably not worth it

Example calculation:

Task: Manual deployments
Frequency: 10 per month = 120 per year
Time per deployment: 15 minutes
Total time: 120 × 15 min = 30 hours/year

Time to build automation: 8 hours

ROI = 30 / 8 = 3.75

✅ Definitely automate this

My Toil Reduction Roadmap

Here's the order I automated tasks in:

Phase 1: Quick Wins (Month 1)

Automate deployments (GitHub Actions)
Automated health checks and restarts
Automated backups (cron job)

Impact: Saved 5 hours/week

Phase 2: Common Toil (Months 2-3)

Auto-scaling for traffic spikes
Automated log rotation
Automated SSL cert renewal (Let's Encrypt)
Circuit breakers for resilience

Impact: Saved additional 3 hours/week

Phase 3: Advanced Automation (Months 4-6)

Self-healing for common incidents
Automated capacity alerts
Automated database maintenance
Chaos engineering tests

Impact: Saved additional 2 hours/week

Measuring Success

I track these metrics to measure toil reduction:

# Time spent on toil (self-reported weekly)
toil_hours_per_week

# Number of manual deployments
manual_deployments_total

# Number of automated remediations
automated_remediations_total{type="oom_restart"}
automated_remediations_total{type="circuit_breaker"}
automated_remediations_total{type="auto_scale"}

# Incidents requiring human intervention
incidents_total{required_human="true"}

My dashboard shows:

Current toil: 1.5 hours/week (down from 11 hours)
Toil reduction: 86%

Automated remediations per week: 12
Incidents requiring human: 1-2 per month

Key Takeaways

Measure toil first. Track your time for a month to find the biggest opportunities.
Toil isn't all operational work. Engineering, projects, and learning are valuable - only repetitive, manual, automatable tasks are toil.
Start with deployments. For most teams, this is the biggest time sink and easiest to automate.
Build self-healing systems. The best automation is the kind that fixes problems without human intervention.
Calculate ROI before automating. Don't spend 3 weeks automating a 5-minute monthly task.
Some things shouldn't be automated. Judgment calls, critical decisions, and rarely-performed tasks often aren't worth automating.

Conclusion

When I started tracking my time, I was shocked to find 73% was toil. Now it's under 10%. That's 8+ hours per week I got back - time I now spend building features, improving reliability, and honestly, not working weekends.

The key is to be systematic:

Measure your toil
Prioritize by ROI
Automate ruthlessly
Build self-healing systems
Keep measuring

Start small. Pick one repetitive task this week and automate it. Then do another next week. In a few months, you'll wonder how you ever did things manually.

Resources

Final Thoughts on the SRE Journey

This series started with my 2 AM wake-up call and the realization that I needed to treat operations as a software problem. Through the journey, we covered:

Part 1: SRE fundamentals and building reliability into Go services from the start
Part 2: Defining meaningful SLIs, SLOs, and using error budgets to guide decisions
Part 3: Building comprehensive observability with metrics, logs, and traces
Part 4: Managing incidents professionally with processes and post-mortems
Part 5: Planning capacity and optimizing performance proactively
Part 6: Eliminating toil through systematic automation

The transformation from reactive firefighting to proactive reliability engineering doesn't happen overnight. But each step - instrumenting one service, writing one runbook, automating one task - compounds over time.

You don't need to be Google-scale to benefit from SRE practices. Start small, measure everything, and improve systematically. Your future self will thank you.

Now go build reliable systems.

PreviousPart 5: Capacity Planning and Performance - Growing Without Breaking NextPart 7: Programming for Reliability - Building Systems That Don't Break

Last updated 11 days ago

hashtagThe Wake-Up Call: My Toil Spreadsheet

hashtagWhat is Toil?

hashtag1. Manual

hashtag2. Repetitive

hashtag3. Automatable

hashtag4. Tactical

hashtag5. No Enduring Value

hashtag6. Scales Linearly

hashtagWhat is NOT Toil?

hashtagEngineering Work

hashtagProject Work

hashtagOverhead

hashtagLearning

hashtagMeasuring Toil in Your Workflow

hashtagMy Toil Tracking Sheet

hashtagToil Categories I Track

hashtagAutomating Deployments: My Biggest Win

hashtagBefore: Manual Deployment Process

hashtagAfter: Automated CI/CD Pipeline

hashtagSafe Deployments with Health Checks

hashtagBuilding Self-Healing Systems

hashtagSelf-Healing Pattern 1: Automatic Restarts

hashtagSelf-Healing Pattern 2: Circuit Breakers

hashtagSelf-Healing Pattern 3: Automatic Retry with Backoff

hashtagAutomating Incident Response

hashtagAuto-Remediation Example: Out of Memory

hashtagAuto-Remediation Example: Stuck Processes

hashtagWhen NOT to Automate

hashtagAnti-Pattern 1: Automating Before Understanding

hashtagAnti-Pattern 2: Over-Engineered Automation

hashtagAnti-Pattern 3: Automating Judgment Calls

hashtagMy Automation Decision Framework

hashtagMy Toil Reduction Roadmap

hashtagPhase 1: Quick Wins (Month 1)

hashtagPhase 2: Common Toil (Months 2-3)

hashtagPhase 3: Advanced Automation (Months 4-6)

hashtagMeasuring Success

hashtagKey Takeaways

hashtagConclusion

hashtagResources

hashtagFinal Thoughts on the SRE Journey