Part 2: SLIs, SLOs, and SLAs - Building a Reliability Framework

What You'll Learn: This article shares my journey from vague "99.9% uptime" promises to building a meaningful reliability framework. You'll learn how to choose the right Service Level Indicators (SLIs) for your Go applications, set realistic Service Level Objectives (SLOs), understand Service Level Agreements (SLAs), and implement error budgets that guide engineering decisions. By the end, you'll have concrete methods to measure and communicate reliability.

When "It Works on My Machine" Isn't Good Enough

Three months after launching my Go-based URL shortener service, I got a frustrated message from a friend: "Your service is so slow today!" I checked my basic monitoring dashboard - CPU usage looked fine, memory was normal, no errors in the logs. According to my metrics, everything was "working."

But clearly, it wasn't working well enough for my users.

The problem? I was measuring server health, not user experience. I had no way to answer basic questions like:

  • How fast should my API respond?

  • What error rate is acceptable?

  • When should I wake up at 3 AM vs. dealing with it tomorrow?

That week, I learned about Service Level Indicators, Objectives, and Agreements - the foundation of reliability engineering. This framework gave me a way to define "reliable" in measurable terms and make data-driven decisions about my service.

The Reliability Triangle: SLIs, SLOs, and SLAs

Let me explain these concepts through my URL shortener service, which I'll call goto.link.

Service Level Indicators (SLIs)

SLIs are the metrics that matter to your users. They're not server metrics like CPU usage - they're measurements of the service from the user's perspective.

For my URL shortener, I identified these key SLIs:

  1. Availability: Can users access the service?

  2. Latency: How fast does it respond?

  3. Error Rate: How often do requests fail?

Here's the critical insight I learned: Choose SLIs that directly impact user experience. Server CPU at 80% doesn't matter if users are getting fast responses. Response time over 500ms matters even if CPU is at 20%.

Service Level Objectives (SLOs)

SLOs are targets for your SLIs. They define what "good enough" means for your service.

For goto.link, I set these SLOs:

  1. Availability: 99.9% of requests should succeed (measured over 30 days)

  2. Latency: 95% of requests should complete within 200ms

  3. Error Rate: Less than 0.1% of requests should return 5xx errors

These weren't random numbers. I based them on:

  • Analysis of my actual traffic patterns

  • What my users told me was acceptable

  • The cost/effort required to achieve higher reliability

Service Level Agreements (SLAs)

SLAs are promises to your users with consequences if you break them. They're typically lower than your internal SLOs to give you a safety buffer.

For my free URL shortener, I don't have formal SLAs with financial consequences. But if I were running a paid service, my SLA might be:

  • 99.5% availability (lower than my 99.9% SLO)

  • If I breach this, users get service credits

The gap between SLO (99.9%) and SLA (99.5%) is my safety buffer. I can miss my internal target without breaking promises to users.

The Relationship

Implementing SLIs in Go

Let me show you how I measure SLIs in my Go applications. I'll use my URL shortener as the example.

1. Availability SLI

Availability is the percentage of successful requests over total requests.

In Prometheus, I can then calculate availability:

2. Latency SLI

For latency, I use histograms to track the distribution of response times.

PromQL to check if we're meeting our 95th percentile latency SLO:

3. Complete Middleware Implementation

Here's how I integrate SLI recording into my HTTP middleware:

4. Application Integration

Defining Your SLOs: The Process I Follow

When I start a new service, here's my process for defining SLOs:

Step 1: Understand Your Users

For my URL shortener, I asked:

  • Who uses this service? (Friends, family, my blog readers)

  • What do they expect? (Fast redirects, high availability)

  • What's their tolerance for downtime? (A few minutes is fine, hours is not)

Step 2: Measure Current Performance

Before setting targets, I ran my service for a month and measured actual performance:

Step 3: Set Realistic SLOs

I don't aim for perfection. Instead, I ask: "What's good enough?"

My SLOs for goto.link:

SLI
SLO Target
Why This Number

Availability

99.9%

~43min downtime/month acceptable for a free service

P95 Latency

200ms

Fast enough for good UX, achievable with current architecture

P99 Latency

500ms

Handles outliers, allows for occasional slow requests

Error Rate

< 0.1%

Most users won't encounter errors

Key insight: I didn't pick 99.99% availability because:

  1. It would require significant infrastructure investment

  2. My users don't need that level of reliability

  3. It would slow down feature development

Step 4: Define the Error Budget

The error budget is the difference between 100% and your SLO.

For 99.9% availability:

  • Error budget: 0.1% = 43.2 minutes of downtime per month

  • That's my allowance for failures, deployments, experiments

Using Error Budgets to Make Decisions

Error budgets changed how I make engineering decisions. Here's how I use them:

Scenario 1: Plenty of Error Budget Remaining

Decision: I have budget to spare! I can:

  • Deploy new features more aggressively

  • Experiment with new infrastructure

  • Take calculated risks

Scenario 2: Error Budget Nearly Exhausted

Decision: Slow down! I should:

  • Freeze feature deployments

  • Focus on reliability improvements

  • Review recent incidents

  • Only deploy critical bug fixes

Error Budget Policy I Created

I documented this policy for my projects:

Tracking SLOs with Prometheus and Grafana

I built a dashboard to track my SLOs in real-time.

Prometheus Recording Rules

First, I create recording rules to pre-calculate SLI values:

Alerting Rules

I alert when I'm burning through error budget too quickly:

Grafana Dashboard

Here's the JSON for my SLO dashboard panel:

Real-World Example: When I Broke My SLO

In March, I deployed a new feature to my URL shortener that cached redirect URLs in Redis. Within 2 hours, I got this alert:

I checked my logs and found the Redis connection pool was exhausted, causing 5xx errors. I had two choices:

  1. Roll back immediately - Restore availability

  2. Debug in production - Risk consuming more error budget

Because I was burning 5% per hour and only had 40% budget remaining, I rolled back. The decision was easy because I had the data.

After the rollback:

  • Availability recovered to 99.95%

  • I debugged locally

  • Fixed the connection pool settings

  • Re-deployed with proper load testing

Total error budget used: 8% (about 3.5 minutes of downtime) Remaining budget: 32% (still okay for the month)

Common Mistakes I Made (So You Don't Have To)

Mistake 1: Setting SLOs Too High

My first attempt: "99.99% availability!"

Result: I spent all my time on reliability and barely shipped features. I burned out.

Lesson: Choose SLOs that match your users' needs, not your ego.

Mistake 2: Measuring the Wrong Things

Initially, I measured "server availability" (is the process running?).

Result: The server was "available" but users were experiencing 10-second response times.

Lesson: Measure user experience, not server health.

Mistake 3: Not Using Error Budgets

I had SLOs but didn't track error budgets.

Result: I had no framework for deciding when to ship vs. when to focus on reliability.

Lesson: Error budgets turn reliability into a currency that guides decisions.

Mistake 4: Too Many SLIs

I tried to track 15 different SLIs.

Result: Analysis paralysis. I couldn't figure out what mattered.

Lesson: Start with 2-4 critical SLIs. You can always add more.

Choosing Your SLIs: A Decision Framework

Not sure which SLIs to track? Here's my framework:

For API Services (like my URL shortener)

Must Have:

  1. Availability: % of successful requests

  2. Latency: P95 or P99 response time

Nice to Have: 3. Error Rate: % of requests returning errors 4. Throughput: Requests per second (for capacity planning)

For Batch Processing Services

Must Have:

  1. Success Rate: % of jobs completing successfully

  2. Processing Time: Time to complete a job

Nice to Have: 3. Freshness: How old is the oldest unprocessed item? 4. Queue Depth: How many items waiting to be processed?

For Data Pipeline Services

Must Have:

  1. Data Freshness: How old is the latest data?

  2. Completeness: % of expected data received

Nice to Have: 3. Processing Latency: Time from data arrival to processing 4. Error Rate: % of failed processing attempts

Key Takeaways

After implementing SLIs, SLOs, and error budgets across my Go services:

  1. SLIs must measure user experience, not server health. If users are happy but CPU is high, that's fine. If CPU is perfect but users are frustrated, that's a problem.

  2. SLOs should be realistic, not aspirational. Don't aim for 99.99% if you can't sustain it or don't need it.

  3. Error budgets are powerful because they convert reliability into a currency. "We have 30% budget remaining" is more actionable than "uptime is good."

  4. Start simple with 2-3 critical SLIs. You can always add more as you mature.

  5. Document your policies around error budgets so everyone knows what happens at different thresholds.

What's Next

Now that you have a reliability framework with SLIs, SLOs, and error budgets, the next challenge is observability - actually seeing what's happening in your systems.

In Part 3, we'll dive deep into:

  • The four golden signals of monitoring

  • Building comprehensive observability with Prometheus, logs, and traces

  • Creating dashboards that help during incidents

  • Distributed tracing with OpenTelemetry

Resources

Conclusion

Before learning about SLIs, SLOs, and SLAs, I had no framework for answering "how reliable should my service be?" Now I have:

  • Clear metrics that represent user experience (SLIs)

  • Concrete targets for reliability (SLOs)

  • A decision-making framework based on error budgets

  • Data-driven conversations about reliability vs. velocity

The framework isn't perfect, and my SLOs evolve as I learn more about my users. But having this structure transformed me from guessing about reliability to measuring and improving it systematically.

Start with one service. Pick 2-3 SLIs. Set SLOs slightly below your current performance. Track your error budget. You'll be amazed at how much clarity this brings.

Last updated