Understanding Toil in SRE

When I first stepped into the role of a Site Reliability Engineer at a fast-growing fintech startup, I was excited about building scalable infrastructure and solving complex availability challenges. What I wasn't prepared for was spending 70% of my time on mundane, repetitive tasks that left me feeling like a human button-pusher rather than an engineer. That's when I learned about the concept of "toil" in SRE—and it completely transformed how I approach operations work.

What Toil Really Means: Lessons from the Trenches

In the SRE world, toil isn't just "work we don't like." It's specifically the manual, repetitive, automatable work that scales linearly with service growth and adds no lasting value to the system. After years of battling toil across multiple cloud platforms, I've come to recognize it immediately by these characteristics:

It's mind-numbingly manual - You're doing the same thing over and over with human hands
It's frustratingly repetitive - You've done this exact task many times before
It's obviously automatable - You know a script could do this, if only you had time to write it
It's purely tactical - It solves today's problem but does nothing for tomorrow
It creates no lasting improvement - The system isn't better after you're done
It grows linearly with your service - Double the users, double the toil

The most insidious thing about toil? It creates a vicious cycle. The more toil you have, the less time you have to eliminate it. Break this cycle, and you unlock your team's true potential.

My Personal Battle with Toil on AWS

Early in my SRE career, I was part of a team managing over 200 EC2 instances across multiple AWS accounts. Our typical week included these toil-heavy activities:

Manually adjusting Auto Scaling groups based on expected traffic
SSH-ing into instances to check logs when alerts fired
Rotating credentials and updating them across multiple services
Running the same database maintenance queries on dozens of RDS instances
Manually approving and implementing routine infrastructure changes

I was spending 30+ hours a week on these tasks alone. Something had to change.

How I Eliminated 80% of My AWS Toil

1. Auto Scaling on Steroids with Lambda Functions

My first major toil-reduction win came from automating our EC2 scaling operations. Instead of manually adjusting Auto Scaling groups, I built a Lambda function that analyzed CloudWatch metrics and historical patterns to predict and adjust capacity proactively:

import boto3
import datetime
import numpy as np
from scipy import stats

def lambda_handler(event, context):
    # Get current and historical CloudWatch metrics
    cloudwatch = boto3.client('cloudwatch')
    autoscaling = boto3.client('autoscaling')
    
    # Get historical CPU data
    end_time = datetime.datetime.utcnow()
    start_time = end_time - datetime.timedelta(days=14)
    
    response = cloudwatch.get_metric_data(
        MetricDataQueries=[
            {
                'Id': 'cpu_utilization',
                'MetricStat': {
                    'Metric': {
                        'Namespace': 'AWS/EC2',
                        'MetricName': 'CPUUtilization',
                        'Dimensions': [
                            {'Name': 'AutoScalingGroupName', 'Value': 'web-production-asg'},
                        ]
                    },
                    'Period': 3600,  # 1 hour granularity
                    'Stat': 'Average',
                }
            },
        ],
        StartTime=start_time,
        EndTime=end_time
    )
    
    # Analyze pattern to predict needed capacity
    cpu_data = response['MetricDataResults'][0]['Values']
    day_of_week = end_time.weekday()
    hour_of_day = end_time.hour
    
    # Find similar historical periods
    similar_periods = []
    for i in range(0, len(cpu_data), 24):  # Check each day
        if i + 24 <= len(cpu_data):  # Ensure we have a full day
            similar_periods.append(cpu_data[i + hour_of_day])
    
    # Predict needed capacity based on historical patterns
    predicted_cpu = np.mean(similar_periods) * 1.2  # Add 20% buffer
    
    # Calculate instances needed based on CPU
    instances_needed = max(2, int(predicted_cpu / 30))  # Assume 30% CPU target
    
    # Update Auto Scaling group
    autoscaling.update_auto_scaling_group(
        AutoScalingGroupName='web-production-asg',
        MinSize=instances_needed,
        MaxSize=instances_needed * 2,
        DesiredCapacity=instances_needed
    )
    
    return {
        'statusCode': 200,
        'body': f'Updated ASG to {instances_needed} instances based on predicted CPU of {predicted_cpu:.2f}%'
    }

This single Lambda function eliminated about 10 hours of toil per week by running every hour, analyzing historical CloudWatch metrics, and adjusting our Auto Scaling groups accordingly. The best part? It actually did a better job than our manual adjustments because it could analyze patterns we hadn't even noticed.

2. Centralized Logging with CloudWatch Insights

The next big win came from eliminating the need to SSH into instances for troubleshooting. I implemented a comprehensive CloudWatch Logs setup with structured logging:

Resources:
  LoggingLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
        - arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
        
  LogProcessorFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt LoggingLambdaRole.Arn
      Runtime: nodejs14.x
      Code:
        ZipFile: |
          exports.handler = async (event) => {
            const logEvents = event.awslogs.data;
            // Process and structure logs
            // Send alerts if needed
            return { status: 'success' };
          }
      Description: Process application logs

With structured logging in place, we created a CloudWatch dashboard with pre-configured queries for common issues. No more SSH required—a time savings of about 8 hours per week.

3. Credential Management Automation with Secrets Manager and GitLab CI

Credential rotation was another massive time sink. I automated this using AWS Secrets Manager and GitLab CI pipelines:

# GitLab CI Pipeline (.gitlab-ci.yml)
stages:
  - rotate
  - test
  - deploy

rotate-credentials:
  stage: rotate
  image: 
    name: amazon/aws-cli:latest
    entrypoint: [""]
  script:
    - aws secretsmanager rotate-secret --secret-id production/api/keys
    - aws secretsmanager get-secret-value --secret-id production/api/keys --query 'SecretString' --output text > .env.new
    - python compare_secrets.py .env.current .env.new
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
  artifacts:
    paths:
      - .env.new

test-with-new-credentials:
  stage: test
  script:
    - cp .env.new .env
    - npm ci
    - npm test
  needs:
    - rotate-credentials

deploy-new-credentials:
  stage: deploy
  script:
    - aws lambda update-function-configuration --function-name api-handler --environment "Variables={$(cat .env.new | tr '\n' ',')}"
    - aws ecs update-service --cluster production --service api-service --force-new-deployment
  needs:
    - test-with-new-credentials
  when: manual

This pipeline automatically rotates credentials on a schedule, tests systems with the new credentials, and then deploys them across our services. This eliminated another 6 hours of weekly toil.

Building a Toil-Reduction Culture

After seeing the impact of these initial automation efforts, I worked to embed toil reduction into our team's culture:

We instituted "Toil Budgets" - We tracked toil hours and allocated 20% of each sprint specifically to toil-reduction projects
We created a "Toil Leaderboard" - Engineers who automated away the most toil got recognition and rewards
We added a "Toil Review" to our postmortems - Every incident review included explicit discussion of what toil was introduced or revealed by the incident

My AWS Toil-Busting Toolkit

Over time, I've built a toolkit of AWS and GitLab resources specifically for eliminating SRE toil:

1. AWS Lambda Functions for Common Tasks

# Example Lambda for cleaning up old CloudWatch Log Groups
import boto3
import datetime

def lambda_handler(event, context):
    client = boto3.client('logs')
    
    # Get all log groups
    log_groups = []
    response = client.describe_log_groups()
    log_groups.extend(response['logGroups'])
    
    while 'nextToken' in response:
        response = client.describe_log_groups(nextToken=response['nextToken'])
        log_groups.extend(response['logGroups'])
    
    # Filter old, unused log groups
    for log_group in log_groups:
        if 'lastIngestionTime' in log_group:
            last_ingestion = datetime.datetime.fromtimestamp(log_group['lastIngestionTime']/1000)
            if (datetime.datetime.now() - last_ingestion).days > 60:
                print(f"Deleting unused log group: {log_group['logGroupName']}")
                client.delete_log_group(logGroupName=log_group['logGroupName'])
    
    return {
        'statusCode': 200,
        'body': f'Cleaned up {len(log_groups)} log groups'
    }

2. GitLab CI Templates for Infrastructure Validation

# .gitlab/ci/infra-validation.yml
validate-infra:
  image: 
    name: hashicorp/terraform:latest
    entrypoint: [""]
  script:
    - cd terraform
    - terraform init
    - terraform validate
    - terraform plan -out=plan.tfplan
    - terraform show -json plan.tfplan > plan.json
    - python ../scripts/validate_plan.py plan.json
  artifacts:
    paths:
      - terraform/plan.tfplan
      - terraform/plan.json
    expire_in: 1 week

3. AWS Systems Manager for Batch Operations

# AWS CloudFormation Template for Systems Manager Maintenance Window
Resources:
  MaintenanceWindow:
    Type: AWS::SSM::MaintenanceWindow
    Properties:
      Name: WeeklyPatching
      Schedule: cron(0 2 ? * SUN *)
      Duration: 3
      Cutoff: 1
      AllowUnassociatedTargets: false
      
  MaintenanceWindowTarget:
    Type: AWS::SSM::MaintenanceWindowTarget
    Properties:
      WindowId: !Ref MaintenanceWindow
      ResourceType: INSTANCE
      Targets:
        - Key: tag:Environment
          Values: 
            - Production
            
  MaintenanceWindowTask:
    Type: AWS::SSM::MaintenanceWindowTask
    Properties:
      WindowId: !Ref MaintenanceWindow
      TaskArn: AWS-RunPatchBaseline
      ServiceRoleArn: !GetAtt MaintenanceWindowRole.Arn
      TaskType: RUN_COMMAND
      Targets:
        - Key: WindowTargetIds
          Values:
            - !Ref MaintenanceWindowTarget
      TaskParameters:
        Operation:
          Values:
            - Install

The ROI of Toil Reduction

After a year of dedicated toil reduction efforts, here's what changed for my team:

Time spent on repetitive tasks dropped from 70% to 15%
Mean time to resolve incidents decreased by 45%
New feature delivery increased by 60%
Team morale improved dramatically (our internal satisfaction scores rose from 6.2 to 8.9/10)

The most impressive change? When a major AWS region experienced disruption, our automated systems detected the issue, rerouted traffic, and adjusted capacity without any human intervention. Our services remained 100% available while competitors experienced significant downtime.

Starting Your Own Toil Reduction Journey

If you're drowning in AWS operational toil, here's my advice for getting started:

Measure your toil first - Track exactly where your time is going for 2-3 weeks
Target the high-volume, low-complexity tasks - These give the best return on automation investment
Start with AWS Lambda for automation - It's perfect for small automation tasks
Use GitLab CI/CD for orchestration - Build pipelines that link your automations together
Document your wins - Quantify the time saved to justify further investment

Remember: The goal isn't to eliminate all operational work—it's to eliminate the manual, repetitive work that doesn't leverage your engineering skills. This frees you to focus on the creative and complex work that truly delivers value.

The most valuable lesson I've learned? Don't wait for permission to start eliminating toil. Every hour invested in automation will pay dividends for years to come.

Next in this series: I'll share my experience with implementing Error Budgets in AWS environments, and how they transformed our approach to reliability.

PreviousWhat is Site Reliability Engineering?NextWhat is Identity and Access Management?

Last updated 1 day ago