Understanding Toil in SRE
When I first stepped into the role of a Site Reliability Engineer at a fast-growing fintech startup, I was excited about building scalable infrastructure and solving complex availability challenges. What I wasn't prepared for was spending 70% of my time on mundane, repetitive tasks that left me feeling like a human button-pusher rather than an engineer. That's when I learned about the concept of "toil" in SREโand it completely transformed how I approach operations work.
What Toil Really Means: Lessons from the Trenches
In the SRE world, toil isn't just "work we don't like." It's specifically the manual, repetitive, automatable work that scales linearly with service growth and adds no lasting value to the system. After years of battling toil across multiple cloud platforms, I've come to recognize it immediately by these characteristics:
It's mind-numbingly manual - You're doing the same thing over and over with human hands
It's frustratingly repetitive - You've done this exact task many times before
It's obviously automatable - You know a script could do this, if only you had time to write it
It's purely tactical - It solves today's problem but does nothing for tomorrow
It creates no lasting improvement - The system isn't better after you're done
It grows linearly with your service - Double the users, double the toil
The most insidious thing about toil? It creates a vicious cycle. The more toil you have, the less time you have to eliminate it. Break this cycle, and you unlock your team's true potential.
My Personal Battle with Toil on AWS
Early in my SRE career, I was part of a team managing over 200 EC2 instances across multiple AWS accounts. Our typical week included these toil-heavy activities:
Manually adjusting Auto Scaling groups based on expected traffic
SSH-ing into instances to check logs when alerts fired
Rotating credentials and updating them across multiple services
Running the same database maintenance queries on dozens of RDS instances
Manually approving and implementing routine infrastructure changes
I was spending 30+ hours a week on these tasks alone. Something had to change.
How I Eliminated 80% of My AWS Toil
1. Auto Scaling on Steroids with Lambda Functions
My first major toil-reduction win came from automating our EC2 scaling operations. Instead of manually adjusting Auto Scaling groups, I built a Lambda function that analyzed CloudWatch metrics and historical patterns to predict and adjust capacity proactively:
import boto3
import datetime
import numpy as np
from scipy import stats
def lambda_handler(event, context):
# Get current and historical CloudWatch metrics
cloudwatch = boto3.client('cloudwatch')
autoscaling = boto3.client('autoscaling')
# Get historical CPU data
end_time = datetime.datetime.utcnow()
start_time = end_time - datetime.timedelta(days=14)
response = cloudwatch.get_metric_data(
MetricDataQueries=[
{
'Id': 'cpu_utilization',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/EC2',
'MetricName': 'CPUUtilization',
'Dimensions': [
{'Name': 'AutoScalingGroupName', 'Value': 'web-production-asg'},
]
},
'Period': 3600, # 1 hour granularity
'Stat': 'Average',
}
},
],
StartTime=start_time,
EndTime=end_time
)
# Analyze pattern to predict needed capacity
cpu_data = response['MetricDataResults'][0]['Values']
day_of_week = end_time.weekday()
hour_of_day = end_time.hour
# Find similar historical periods
similar_periods = []
for i in range(0, len(cpu_data), 24): # Check each day
if i + 24 <= len(cpu_data): # Ensure we have a full day
similar_periods.append(cpu_data[i + hour_of_day])
# Predict needed capacity based on historical patterns
predicted_cpu = np.mean(similar_periods) * 1.2 # Add 20% buffer
# Calculate instances needed based on CPU
instances_needed = max(2, int(predicted_cpu / 30)) # Assume 30% CPU target
# Update Auto Scaling group
autoscaling.update_auto_scaling_group(
AutoScalingGroupName='web-production-asg',
MinSize=instances_needed,
MaxSize=instances_needed * 2,
DesiredCapacity=instances_needed
)
return {
'statusCode': 200,
'body': f'Updated ASG to {instances_needed} instances based on predicted CPU of {predicted_cpu:.2f}%'
}
This single Lambda function eliminated about 10 hours of toil per week by running every hour, analyzing historical CloudWatch metrics, and adjusting our Auto Scaling groups accordingly. The best part? It actually did a better job than our manual adjustments because it could analyze patterns we hadn't even noticed.
2. Centralized Logging with CloudWatch Insights
The next big win came from eliminating the need to SSH into instances for troubleshooting. I implemented a comprehensive CloudWatch Logs setup with structured logging:
Resources:
LoggingLambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
- arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
LogProcessorFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt LoggingLambdaRole.Arn
Runtime: nodejs14.x
Code:
ZipFile: |
exports.handler = async (event) => {
const logEvents = event.awslogs.data;
// Process and structure logs
// Send alerts if needed
return { status: 'success' };
}
Description: Process application logs
With structured logging in place, we created a CloudWatch dashboard with pre-configured queries for common issues. No more SSH requiredโa time savings of about 8 hours per week.
3. Credential Management Automation with Secrets Manager and GitLab CI
Credential rotation was another massive time sink. I automated this using AWS Secrets Manager and GitLab CI pipelines:
# GitLab CI Pipeline (.gitlab-ci.yml)
stages:
- rotate
- test
- deploy
rotate-credentials:
stage: rotate
image:
name: amazon/aws-cli:latest
entrypoint: [""]
script:
- aws secretsmanager rotate-secret --secret-id production/api/keys
- aws secretsmanager get-secret-value --secret-id production/api/keys --query 'SecretString' --output text > .env.new
- python compare_secrets.py .env.current .env.new
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
artifacts:
paths:
- .env.new
test-with-new-credentials:
stage: test
script:
- cp .env.new .env
- npm ci
- npm test
needs:
- rotate-credentials
deploy-new-credentials:
stage: deploy
script:
- aws lambda update-function-configuration --function-name api-handler --environment "Variables={$(cat .env.new | tr '\n' ',')}"
- aws ecs update-service --cluster production --service api-service --force-new-deployment
needs:
- test-with-new-credentials
when: manual
This pipeline automatically rotates credentials on a schedule, tests systems with the new credentials, and then deploys them across our services. This eliminated another 6 hours of weekly toil.
Building a Toil-Reduction Culture
After seeing the impact of these initial automation efforts, I worked to embed toil reduction into our team's culture:
We instituted "Toil Budgets" - We tracked toil hours and allocated 20% of each sprint specifically to toil-reduction projects
We created a "Toil Leaderboard" - Engineers who automated away the most toil got recognition and rewards
We added a "Toil Review" to our postmortems - Every incident review included explicit discussion of what toil was introduced or revealed by the incident
My AWS Toil-Busting Toolkit
Over time, I've built a toolkit of AWS and GitLab resources specifically for eliminating SRE toil:
1. AWS Lambda Functions for Common Tasks
# Example Lambda for cleaning up old CloudWatch Log Groups
import boto3
import datetime
def lambda_handler(event, context):
client = boto3.client('logs')
# Get all log groups
log_groups = []
response = client.describe_log_groups()
log_groups.extend(response['logGroups'])
while 'nextToken' in response:
response = client.describe_log_groups(nextToken=response['nextToken'])
log_groups.extend(response['logGroups'])
# Filter old, unused log groups
for log_group in log_groups:
if 'lastIngestionTime' in log_group:
last_ingestion = datetime.datetime.fromtimestamp(log_group['lastIngestionTime']/1000)
if (datetime.datetime.now() - last_ingestion).days > 60:
print(f"Deleting unused log group: {log_group['logGroupName']}")
client.delete_log_group(logGroupName=log_group['logGroupName'])
return {
'statusCode': 200,
'body': f'Cleaned up {len(log_groups)} log groups'
}
2. GitLab CI Templates for Infrastructure Validation
# .gitlab/ci/infra-validation.yml
validate-infra:
image:
name: hashicorp/terraform:latest
entrypoint: [""]
script:
- cd terraform
- terraform init
- terraform validate
- terraform plan -out=plan.tfplan
- terraform show -json plan.tfplan > plan.json
- python ../scripts/validate_plan.py plan.json
artifacts:
paths:
- terraform/plan.tfplan
- terraform/plan.json
expire_in: 1 week
3. AWS Systems Manager for Batch Operations
# AWS CloudFormation Template for Systems Manager Maintenance Window
Resources:
MaintenanceWindow:
Type: AWS::SSM::MaintenanceWindow
Properties:
Name: WeeklyPatching
Schedule: cron(0 2 ? * SUN *)
Duration: 3
Cutoff: 1
AllowUnassociatedTargets: false
MaintenanceWindowTarget:
Type: AWS::SSM::MaintenanceWindowTarget
Properties:
WindowId: !Ref MaintenanceWindow
ResourceType: INSTANCE
Targets:
- Key: tag:Environment
Values:
- Production
MaintenanceWindowTask:
Type: AWS::SSM::MaintenanceWindowTask
Properties:
WindowId: !Ref MaintenanceWindow
TaskArn: AWS-RunPatchBaseline
ServiceRoleArn: !GetAtt MaintenanceWindowRole.Arn
TaskType: RUN_COMMAND
Targets:
- Key: WindowTargetIds
Values:
- !Ref MaintenanceWindowTarget
TaskParameters:
Operation:
Values:
- Install
The ROI of Toil Reduction
After a year of dedicated toil reduction efforts, here's what changed for my team:
Time spent on repetitive tasks dropped from 70% to 15%
Mean time to resolve incidents decreased by 45%
New feature delivery increased by 60%
Team morale improved dramatically (our internal satisfaction scores rose from 6.2 to 8.9/10)
The most impressive change? When a major AWS region experienced disruption, our automated systems detected the issue, rerouted traffic, and adjusted capacity without any human intervention. Our services remained 100% available while competitors experienced significant downtime.
Starting Your Own Toil Reduction Journey
If you're drowning in AWS operational toil, here's my advice for getting started:
Measure your toil first - Track exactly where your time is going for 2-3 weeks
Target the high-volume, low-complexity tasks - These give the best return on automation investment
Start with AWS Lambda for automation - It's perfect for small automation tasks
Use GitLab CI/CD for orchestration - Build pipelines that link your automations together
Document your wins - Quantify the time saved to justify further investment
Remember: The goal isn't to eliminate all operational workโit's to eliminate the manual, repetitive work that doesn't leverage your engineering skills. This frees you to focus on the creative and complex work that truly delivers value.
The most valuable lesson I've learned? Don't wait for permission to start eliminating toil. Every hour invested in automation will pay dividends for years to come.
Next in this series: I'll share my experience with implementing Error Budgets in AWS environments, and how they transformed our approach to reliability.
Last updated