Error Budgets in SRE

When I first stepped into my role as an DevSecOps lead at a fast-growing fintech startup, I inherited a chaotic incident management process. Our development teams were shipping features at breakneck speed, while our operations team was drowning in alerts and unplanned work. That's when I introduced error budgets - a concept that transformed how we balanced reliability and innovation.

How I Discovered the Power of Error Budgets

After our third major outage in two months, our CTO called me into his office. "We need reliability, but we also can't slow down feature development," he said. "How do we solve this?"

That night, I sketched out our first error budget framework on a napkin at a coffee shop. The next morning, I presented a simple idea:

"What if we agree on how unreliable our system is allowed to be, and use that to make engineering decisions?"

That question kicked off our error budget journey - one that saved our company from the reliability vs. speed tug-of-war that plagues so many engineering organizations.

Breaking Down Error Budgets From My Experience

In essence, an error budget is a quantitative measure of unreliability that we're willing to tolerate. It creates a common language between product and engineering teams by converting reliability into a currency that everyone understands. Here's how I explain it to new team members:

  1. Start with your SLO (Service Level Objective): This is your reliability target, like "Our API should respond successfully 99.95% of the time"

  2. Calculate your error budget: It's simply 100% minus your SLO. For our 99.95% SLO, our error budget was 0.05%

  3. Translate to meaningful units: For us, this meant we could have about 21.6 minutes of downtime per month (0.05% of a 30-day month)

Once we had this framework, everything changed. When an incident consumed a large chunk of our error budget, we'd automatically shift focus to reliability improvements. When we had plenty of budget remaining, we'd accelerate feature development.

Implementing Error Budgets with AWS Tools

Let me share exactly how we set up our error budget monitoring in AWS, as this was a game-changer for us:

1. Measuring SLIs with CloudWatch Metrics

First, we needed reliable Service Level Indicators (SLIs). For our main API service running on ECS, we created CloudWatch custom metrics:

2. Creating SLO Dashboards in CloudWatch

With reliable metrics flowing in, I created a CloudWatch dashboard specifically for tracking our SLOs and error budgets:

  1. First, I created a metric math expression to calculate our success rate:

    Where:

    • m1 = sum of successful requests

    • m2 = total requests

  2. Then, I added a horizontal annotation line at our SLO threshold (99.95%)

  3. Finally, I created a widget showing our remaining error budget as a percentage:

    Where s1 is our SLO percentage (99.95)

3. Setting Up Error Budget Burn Rate Alarms

The most valuable alarms we created were for "burn rate" - how quickly we were consuming our error budget:

4. Integrating with AWS Lambda for Error Budget Reports

I wrote a Lambda function that ran weekly to calculate our remaining error budget and email it to all engineering leads:

Real-world Impact: How Error Budgets Changed Our Culture

The most profound change came from how error budgets shifted our engineering culture:

1. From Blame to Objective Discussion

Before error budgets, our incident retrospectives were tense, with product managers and engineers often pointing fingers. After implementing error budgets, our discussions became data-driven. I'll never forget a product manager saying "I see we've used 80% of our monthly error budget. Let's postpone the new checkout feature and fix those database timeouts first."

2. From Panic to Planned Responses

We established clear policies tied to our error budget consumption:

  • Budget Used < 50%: Full speed ahead on features

  • Budget Used 50-75%: Proceed with caution, high-risk changes require extra review

  • Budget Used 75-90%: One reliability improvement must be prioritized alongside any new feature

  • Budget Used > 90%: Feature freeze, focus exclusively on reliability

3. From Arbitrary Targets to Business Alignment

We aligned our SLOs with business metrics. For example, when analyzing our payment API data, we found that a 99.95% reliability target (rather than trying for 99.99%) gave us the optimal balance between engineering effort and user satisfaction. This saved us from over-engineering and allowed for quicker innovation.

My Hard-earned Lessons on AWS Error Budget Implementation

If you're implementing error budgets with AWS, here are some hard-won insights:

1. CloudWatch Metric Resolution Matters

Initially, we used the default 5-minute resolution for our metrics and couldn't understand why our error budget calculations were off. Switching to 1-minute resolution made a huge difference in accuracy, especially for detecting brief outages.

2. Don't Try to Monitor Everything

We initially tried to create SLOs for every microservice (we had over 30!). It was overwhelming. I learned to focus on customer journey-based SLOs instead. For example, we monitored the critical payment flow end-to-end rather than each individual service.

3. Use Multiple Time Windows

One month is too long to wait before detecting problems. We ended up using cascading time windows for our error budgets:

  • 1-hour window (very sensitive, used for immediate alerts)

  • 24-hour window (balanced sensitivity, used for daily standups)

  • 30-day window (the "official" error budget for planning)

This approach gave us early warnings while preventing alert fatigue.

Creating an Error Budget Policy That Actually Works

The key to success was creating a clear error budget policy document. Here's a simplified version of what worked for us:

Conclusion: Error Budgets Transform SRE Culture

Looking back, introducing error budgets was the single most important change I made as an SRE lead. It translated reliability from a vague goal into a concrete metric that both technical and non-technical stakeholders could understand and act upon.

On AWS, we were able to implement a sophisticated error budget system using native tools like CloudWatch, Lambda, and SNS. This gave us real-time visibility into our reliability posture without expensive third-party solutions.

The most rewarding moment came six months after implementation, when I overheard our product manager tell a new hire: "Before we plan this feature, let's check our error budget to see if we have the reliability headroom for it." That's when I knew the cultural change had truly taken hold.

If you're considering implementing error budgets in your organization, start simple, focus on metrics that matter to your customers, and use the tools AWS provides to automate as much as possible. Your future self (and your on-call engineers) will thank you!

Last updated