Part 2: SLIs, SLOs, and SLAs - Building a Reliability Framework
What You'll Learn: This article shares my journey from vague "99.9% uptime" promises to building a meaningful reliability framework. You'll learn how to choose the right Service Level Indicators (SLIs) for your Go applications, set realistic Service Level Objectives (SLOs), understand Service Level Agreements (SLAs), and implement error budgets that guide engineering decisions. By the end, you'll have concrete methods to measure and communicate reliability.
When "It Works on My Machine" Isn't Good Enough
Three months after launching my Go-based URL shortener service, I got a frustrated message from a friend: "Your service is so slow today!" I checked my basic monitoring dashboard - CPU usage looked fine, memory was normal, no errors in the logs. According to my metrics, everything was "working."
But clearly, it wasn't working well enough for my users.
The problem? I was measuring server health, not user experience. I had no way to answer basic questions like:
How fast should my API respond?
What error rate is acceptable?
When should I wake up at 3 AM vs. dealing with it tomorrow?
That week, I learned about Service Level Indicators, Objectives, and Agreements - the foundation of reliability engineering. This framework gave me a way to define "reliable" in measurable terms and make data-driven decisions about my service.
The Reliability Triangle: SLIs, SLOs, and SLAs
Let me explain these concepts through my URL shortener service, which I'll call goto.link.
Service Level Indicators (SLIs)
SLIs are the metrics that matter to your users. They're not server metrics like CPU usage - they're measurements of the service from the user's perspective.
For my URL shortener, I identified these key SLIs:
Availability: Can users access the service?
Latency: How fast does it respond?
Error Rate: How often do requests fail?
Here's the critical insight I learned: Choose SLIs that directly impact user experience. Server CPU at 80% doesn't matter if users are getting fast responses. Response time over 500ms matters even if CPU is at 20%.
Service Level Objectives (SLOs)
SLOs are targets for your SLIs. They define what "good enough" means for your service.
For goto.link, I set these SLOs:
Availability: 99.9% of requests should succeed (measured over 30 days)
Latency: 95% of requests should complete within 200ms
Error Rate: Less than 0.1% of requests should return 5xx errors
These weren't random numbers. I based them on:
Analysis of my actual traffic patterns
What my users told me was acceptable
The cost/effort required to achieve higher reliability
Service Level Agreements (SLAs)
SLAs are promises to your users with consequences if you break them. They're typically lower than your internal SLOs to give you a safety buffer.
For my free URL shortener, I don't have formal SLAs with financial consequences. But if I were running a paid service, my SLA might be:
99.5% availability (lower than my 99.9% SLO)
If I breach this, users get service credits
The gap between SLO (99.9%) and SLA (99.5%) is my safety buffer. I can miss my internal target without breaking promises to users.
The Relationship
Implementing SLIs in Go
Let me show you how I measure SLIs in my Go applications. I'll use my URL shortener as the example.
1. Availability SLI
Availability is the percentage of successful requests over total requests.
In Prometheus, I can then calculate availability:
2. Latency SLI
For latency, I use histograms to track the distribution of response times.
PromQL to check if we're meeting our 95th percentile latency SLO:
3. Complete Middleware Implementation
Here's how I integrate SLI recording into my HTTP middleware:
4. Application Integration
Defining Your SLOs: The Process I Follow
When I start a new service, here's my process for defining SLOs:
Step 1: Understand Your Users
For my URL shortener, I asked:
Who uses this service? (Friends, family, my blog readers)
What do they expect? (Fast redirects, high availability)
What's their tolerance for downtime? (A few minutes is fine, hours is not)
Step 2: Measure Current Performance
Before setting targets, I ran my service for a month and measured actual performance:
Step 3: Set Realistic SLOs
I don't aim for perfection. Instead, I ask: "What's good enough?"
My SLOs for goto.link:
Availability
99.9%
~43min downtime/month acceptable for a free service
P95 Latency
200ms
Fast enough for good UX, achievable with current architecture
P99 Latency
500ms
Handles outliers, allows for occasional slow requests
Error Rate
< 0.1%
Most users won't encounter errors
Key insight: I didn't pick 99.99% availability because:
It would require significant infrastructure investment
My users don't need that level of reliability
It would slow down feature development
Step 4: Define the Error Budget
The error budget is the difference between 100% and your SLO.
For 99.9% availability:
Error budget: 0.1% = 43.2 minutes of downtime per month
That's my allowance for failures, deployments, experiments
Using Error Budgets to Make Decisions
Error budgets changed how I make engineering decisions. Here's how I use them:
Scenario 1: Plenty of Error Budget Remaining
Decision: I have budget to spare! I can:
Deploy new features more aggressively
Experiment with new infrastructure
Take calculated risks
Scenario 2: Error Budget Nearly Exhausted
Decision: Slow down! I should:
Freeze feature deployments
Focus on reliability improvements
Review recent incidents
Only deploy critical bug fixes
Error Budget Policy I Created
I documented this policy for my projects:
Tracking SLOs with Prometheus and Grafana
I built a dashboard to track my SLOs in real-time.
Prometheus Recording Rules
First, I create recording rules to pre-calculate SLI values:
Alerting Rules
I alert when I'm burning through error budget too quickly:
Grafana Dashboard
Here's the JSON for my SLO dashboard panel:
Real-World Example: When I Broke My SLO
In March, I deployed a new feature to my URL shortener that cached redirect URLs in Redis. Within 2 hours, I got this alert:
I checked my logs and found the Redis connection pool was exhausted, causing 5xx errors. I had two choices:
Roll back immediately - Restore availability
Debug in production - Risk consuming more error budget
Because I was burning 5% per hour and only had 40% budget remaining, I rolled back. The decision was easy because I had the data.
After the rollback:
Availability recovered to 99.95%
I debugged locally
Fixed the connection pool settings
Re-deployed with proper load testing
Total error budget used: 8% (about 3.5 minutes of downtime) Remaining budget: 32% (still okay for the month)
Common Mistakes I Made (So You Don't Have To)
Mistake 1: Setting SLOs Too High
My first attempt: "99.99% availability!"
Result: I spent all my time on reliability and barely shipped features. I burned out.
Lesson: Choose SLOs that match your users' needs, not your ego.
Mistake 2: Measuring the Wrong Things
Initially, I measured "server availability" (is the process running?).
Result: The server was "available" but users were experiencing 10-second response times.
Lesson: Measure user experience, not server health.
Mistake 3: Not Using Error Budgets
I had SLOs but didn't track error budgets.
Result: I had no framework for deciding when to ship vs. when to focus on reliability.
Lesson: Error budgets turn reliability into a currency that guides decisions.
Mistake 4: Too Many SLIs
I tried to track 15 different SLIs.
Result: Analysis paralysis. I couldn't figure out what mattered.
Lesson: Start with 2-4 critical SLIs. You can always add more.
Choosing Your SLIs: A Decision Framework
Not sure which SLIs to track? Here's my framework:
For API Services (like my URL shortener)
Must Have:
Availability: % of successful requests
Latency: P95 or P99 response time
Nice to Have: 3. Error Rate: % of requests returning errors 4. Throughput: Requests per second (for capacity planning)
For Batch Processing Services
Must Have:
Success Rate: % of jobs completing successfully
Processing Time: Time to complete a job
Nice to Have: 3. Freshness: How old is the oldest unprocessed item? 4. Queue Depth: How many items waiting to be processed?
For Data Pipeline Services
Must Have:
Data Freshness: How old is the latest data?
Completeness: % of expected data received
Nice to Have: 3. Processing Latency: Time from data arrival to processing 4. Error Rate: % of failed processing attempts
Key Takeaways
After implementing SLIs, SLOs, and error budgets across my Go services:
SLIs must measure user experience, not server health. If users are happy but CPU is high, that's fine. If CPU is perfect but users are frustrated, that's a problem.
SLOs should be realistic, not aspirational. Don't aim for 99.99% if you can't sustain it or don't need it.
Error budgets are powerful because they convert reliability into a currency. "We have 30% budget remaining" is more actionable than "uptime is good."
Start simple with 2-3 critical SLIs. You can always add more as you mature.
Document your policies around error budgets so everyone knows what happens at different thresholds.
What's Next
Now that you have a reliability framework with SLIs, SLOs, and error budgets, the next challenge is observability - actually seeing what's happening in your systems.
In Part 3, we'll dive deep into:
The four golden signals of monitoring
Building comprehensive observability with Prometheus, logs, and traces
Creating dashboards that help during incidents
Distributed tracing with OpenTelemetry
Resources
Google's Art of SLOs - Comprehensive guide to SLOs
SRE Book - Service Level Objectives - Chapter 4 of the SRE book
Prometheus Best Practices - Metrics and monitoring best practices
SLO Generator - Tool to compute SLO metrics
Conclusion
Before learning about SLIs, SLOs, and SLAs, I had no framework for answering "how reliable should my service be?" Now I have:
Clear metrics that represent user experience (SLIs)
Concrete targets for reliability (SLOs)
A decision-making framework based on error budgets
Data-driven conversations about reliability vs. velocity
The framework isn't perfect, and my SLOs evolve as I learn more about my users. But having this structure transformed me from guessing about reliability to measuring and improving it systematically.
Start with one service. Pick 2-3 SLIs. Set SLOs slightly below your current performance. Track your error budget. You'll be amazed at how much clarity this brings.
Last updated