Error Budgets in SRE
When I first stepped into my role as an DevSecOps lead at a fast-growing fintech startup, I inherited a chaotic incident management process. Our development teams were shipping features at breakneck speed, while our operations team was drowning in alerts and unplanned work. That's when I introduced error budgets - a concept that transformed how we balanced reliability and innovation.
How I Discovered the Power of Error Budgets
After our third major outage in two months, our CTO called me into his office. "We need reliability, but we also can't slow down feature development," he said. "How do we solve this?"
That night, I sketched out our first error budget framework on a napkin at a coffee shop. The next morning, I presented a simple idea:
"What if we agree on how unreliable our system is allowed to be, and use that to make engineering decisions?"
That question kicked off our error budget journey - one that saved our company from the reliability vs. speed tug-of-war that plagues so many engineering organizations.
Breaking Down Error Budgets From My Experience
In essence, an error budget is a quantitative measure of unreliability that we're willing to tolerate. It creates a common language between product and engineering teams by converting reliability into a currency that everyone understands. Here's how I explain it to new team members:
Start with your SLO (Service Level Objective): This is your reliability target, like "Our API should respond successfully 99.95% of the time"
Calculate your error budget: It's simply 100% minus your SLO. For our 99.95% SLO, our error budget was 0.05%
Translate to meaningful units: For us, this meant we could have about 21.6 minutes of downtime per month (0.05% of a 30-day month)
Once we had this framework, everything changed. When an incident consumed a large chunk of our error budget, we'd automatically shift focus to reliability improvements. When we had plenty of budget remaining, we'd accelerate feature development.
Implementing Error Budgets with AWS Tools
Let me share exactly how we set up our error budget monitoring in AWS, as this was a game-changer for us:
1. Measuring SLIs with CloudWatch Metrics
First, we needed reliable Service Level Indicators (SLIs). For our main API service running on ECS, we created CloudWatch custom metrics:
// Inside our Express.js API middleware
app.use((req, res, next) => {
const startTime = Date.now();
// When response is sent
res.on('finish', () => {
const duration = Date.now() - startTime;
const successful = res.statusCode < 500; // Simplified success criteria
// Push custom metrics to CloudWatch
const cloudwatch = new AWS.CloudWatch();
cloudwatch.putMetricData({
Namespace: 'API/Services',
MetricData: [
{
MetricName: 'RequestLatency',
Dimensions: [{ Name: 'Service', Value: 'PaymentAPI' }],
Value: duration,
Unit: 'Milliseconds'
},
{
MetricName: 'RequestSuccess',
Dimensions: [{ Name: 'Service', Value: 'PaymentAPI' }],
Value: successful ? 1 : 0,
Unit: 'Count'
}
]
}).promise();
});
next();
});
2. Creating SLO Dashboards in CloudWatch
With reliable metrics flowing in, I created a CloudWatch dashboard specifically for tracking our SLOs and error budgets:
First, I created a metric math expression to calculate our success rate:
(m1/m2)*100
Where:
m1 = sum of successful requests
m2 = total requests
Then, I added a horizontal annotation line at our SLO threshold (99.95%)
Finally, I created a widget showing our remaining error budget as a percentage:
(s1-MIN(PERIOD(m1/m2*100,1d)))/s1*100
Where s1 is our SLO percentage (99.95)
3. Setting Up Error Budget Burn Rate Alarms
The most valuable alarms we created were for "burn rate" - how quickly we were consuming our error budget:
# Sample AWS CloudFormation snippet
ErrorBudgetBurnRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ErrorBudgetBurnRate-Critical
AlarmDescription: "Error budget consumption rate is too high"
MetricName: RequestSuccess
Namespace: API/Services
Dimensions:
- Name: Service
Value: PaymentAPI
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 0.9990 # 10x normal burn rate for our 99.95% SLO
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions:
- !Ref PagerDutyAlarmTopic
4. Integrating with AWS Lambda for Error Budget Reports
I wrote a Lambda function that ran weekly to calculate our remaining error budget and email it to all engineering leads:
import boto3
import datetime
from datetime import timedelta
def lambda_handler(event, context):
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
# Get the start of the month
now = datetime.datetime.utcnow()
start_of_month = datetime.datetime(now.year, now.month, 1)
# Calculate the time passed in the month as a percentage
days_in_month = (datetime.datetime(now.year, now.month % 12 + 1, 1) - timedelta(days=1)).day
month_percent_elapsed = (now - start_of_month).total_seconds() / (days_in_month * 24 * 60 * 60) * 100
# Get our success rate for the month
response = cloudwatch.get_metric_statistics(
Namespace='API/Services',
MetricName='RequestSuccess',
Dimensions=[{'Name': 'Service', 'Value': 'PaymentAPI'}],
StartTime=start_of_month,
EndTime=now,
Period=86400, # Daily
Statistics=['Average']
)
# Calculate current success rate
datapoints = response['Datapoints']
if not datapoints:
return
success_rate = sum(point['Average'] for point in datapoints) / len(datapoints) * 100
# Our SLO is 99.95%
slo = 99.95
error_budget = 100 - slo
# Calculate how much budget we've used
budget_used_percent = (slo - success_rate) / error_budget * 100
# Calculate burn rate
expected_budget_used = month_percent_elapsed
burn_rate = budget_used_percent / expected_budget_used if expected_budget_used > 0 else 0
# Format our message
message = f"""
Error Budget Report: {now.strftime('%B %d, %Y')}
SLO: {slo}%
Current Success Rate: {success_rate:.3f}%
Error Budget Used: {budget_used_percent:.1f}%
Month Elapsed: {month_percent_elapsed:.1f}%
Burn Rate: {burn_rate:.2f}x
Status: {"🔴 CRITICAL" if burn_rate > 1.5 else "🟡 WARNING" if burn_rate > 1.0 else "🟢 HEALTHY"}
{'ALERT: Error budget being consumed faster than expected. Engineering focus should shift to reliability improvements.' if burn_rate > 1 else 'Error budget consumption on track. Proceed with planned feature development.'}
"""
# Send the report
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:ErrorBudgetReports',
Subject=f"Error Budget Report: {'🔴' if burn_rate > 1 else '🟢'} {budget_used_percent:.1f}% Used",
Message=message
)
Real-world Impact: How Error Budgets Changed Our Culture
The most profound change came from how error budgets shifted our engineering culture:
1. From Blame to Objective Discussion
Before error budgets, our incident retrospectives were tense, with product managers and engineers often pointing fingers. After implementing error budgets, our discussions became data-driven. I'll never forget a product manager saying "I see we've used 80% of our monthly error budget. Let's postpone the new checkout feature and fix those database timeouts first."
2. From Panic to Planned Responses
We established clear policies tied to our error budget consumption:
Budget Used < 50%: Full speed ahead on features
Budget Used 50-75%: Proceed with caution, high-risk changes require extra review
Budget Used 75-90%: One reliability improvement must be prioritized alongside any new feature
Budget Used > 90%: Feature freeze, focus exclusively on reliability
3. From Arbitrary Targets to Business Alignment
We aligned our SLOs with business metrics. For example, when analyzing our payment API data, we found that a 99.95% reliability target (rather than trying for 99.99%) gave us the optimal balance between engineering effort and user satisfaction. This saved us from over-engineering and allowed for quicker innovation.
My Hard-earned Lessons on AWS Error Budget Implementation
If you're implementing error budgets with AWS, here are some hard-won insights:
1. CloudWatch Metric Resolution Matters
Initially, we used the default 5-minute resolution for our metrics and couldn't understand why our error budget calculations were off. Switching to 1-minute resolution made a huge difference in accuracy, especially for detecting brief outages.
# Use high-resolution metrics
MetricDetail:
Type: AWS::CloudWatch::MetricStream
Properties:
OutputFormat: json
FirehoseArn: !GetAtt MetricsFirehose.Arn
RoleArn: !GetAtt MetricsRole.Arn
IncludeFilters:
- Namespace: API/Services
StatisticsConfiguration:
- IncludeMetrics:
- Namespace: API/Services
MetricName: RequestSuccess
AdditionalStatistics:
- p90
- p99
2. Don't Try to Monitor Everything
We initially tried to create SLOs for every microservice (we had over 30!). It was overwhelming. I learned to focus on customer journey-based SLOs instead. For example, we monitored the critical payment flow end-to-end rather than each individual service.
3. Use Multiple Time Windows
One month is too long to wait before detecting problems. We ended up using cascading time windows for our error budgets:
1-hour window (very sensitive, used for immediate alerts)
24-hour window (balanced sensitivity, used for daily standups)
30-day window (the "official" error budget for planning)
This approach gave us early warnings while preventing alert fatigue.
Creating an Error Budget Policy That Actually Works
The key to success was creating a clear error budget policy document. Here's a simplified version of what worked for us:
# Error Budget Policy
## Purpose
This policy establishes how we use error budgets to balance reliability and innovation.
## Definitions
- SLO: 99.95% API request success rate
- Error Budget: 0.05% failed requests (21.6 minutes downtime per month)
- Burn Rate: Rate of error budget consumption relative to time elapsed
## Response Actions
| Burn Rate | Action |
|-----------|--------|
| <1.0x | Proceed normally |
| 1.0x-2.0x | Add one reliability task to next sprint |
| 2.0x-10.0x | Pause non-essential feature work until burn rate decreases |
| >10.0x | Emergency response mode, all hands on reliability |
## Circuit Breakers
If >50% of error budget is consumed by a single incident, leadership team must approve resuming feature work.
## Exceptions
Error budget enforcement may be temporarily suspended during:
- Major launches (pre-approved)
- Black Friday and holiday shopping period
Conclusion: Error Budgets Transform SRE Culture
Looking back, introducing error budgets was the single most important change I made as an SRE lead. It translated reliability from a vague goal into a concrete metric that both technical and non-technical stakeholders could understand and act upon.
On AWS, we were able to implement a sophisticated error budget system using native tools like CloudWatch, Lambda, and SNS. This gave us real-time visibility into our reliability posture without expensive third-party solutions.
The most rewarding moment came six months after implementation, when I overheard our product manager tell a new hire: "Before we plan this feature, let's check our error budget to see if we have the reliability headroom for it." That's when I knew the cultural change had truly taken hold.
If you're considering implementing error budgets in your organization, start simple, focus on metrics that matter to your customers, and use the tools AWS provides to automate as much as possible. Your future self (and your on-call engineers) will thank you!
Last updated