Part 2: SLIs, SLOs, and SLAs - Building a Reliability Framework

What You'll Learn: This article shares my journey from vague "99.9% uptime" promises to building a meaningful reliability framework. You'll learn how to choose the right Service Level Indicators (SLIs) for your Go applications, set realistic Service Level Objectives (SLOs), understand Service Level Agreements (SLAs), and implement error budgets that guide engineering decisions. By the end, you'll have concrete methods to measure and communicate reliability.

When "It Works on My Machine" Isn't Good Enough

Three months after launching my Go-based URL shortener service, I got a frustrated message from a friend: "Your service is so slow today!" I checked my basic monitoring dashboard - CPU usage looked fine, memory was normal, no errors in the logs. According to my metrics, everything was "working."

But clearly, it wasn't working well enough for my users.

The problem? I was measuring server health, not user experience. I had no way to answer basic questions like:

How fast should my API respond?
What error rate is acceptable?
When should I wake up at 3 AM vs. dealing with it tomorrow?

That week, I learned about Service Level Indicators, Objectives, and Agreements - the foundation of reliability engineering. This framework gave me a way to define "reliable" in measurable terms and make data-driven decisions about my service.

The Reliability Triangle: SLIs, SLOs, and SLAs

Let me explain these concepts through my URL shortener service, which I'll call goto.link.

Service Level Indicators (SLIs)

SLIs are the metrics that matter to your users. They're not server metrics like CPU usage - they're measurements of the service from the user's perspective.

For my URL shortener, I identified these key SLIs:

Availability: Can users access the service?
Latency: How fast does it respond?
Error Rate: How often do requests fail?

Here's the critical insight I learned: Choose SLIs that directly impact user experience. Server CPU at 80% doesn't matter if users are getting fast responses. Response time over 500ms matters even if CPU is at 20%.

Service Level Objectives (SLOs)

SLOs are targets for your SLIs. They define what "good enough" means for your service.

For goto.link, I set these SLOs:

Availability: 99.9% of requests should succeed (measured over 30 days)
Latency: 95% of requests should complete within 200ms
Error Rate: Less than 0.1% of requests should return 5xx errors

These weren't random numbers. I based them on:

Analysis of my actual traffic patterns
What my users told me was acceptable
The cost/effort required to achieve higher reliability

Service Level Agreements (SLAs)

SLAs are promises to your users with consequences if you break them. They're typically lower than your internal SLOs to give you a safety buffer.

For my free URL shortener, I don't have formal SLAs with financial consequences. But if I were running a paid service, my SLA might be:

99.5% availability (lower than my 99.9% SLO)
If I breach this, users get service credits

The gap between SLO (99.9%) and SLA (99.5%) is my safety buffer. I can miss my internal target without breaking promises to users.

The Relationship

User Experience → SLI (what to measure)
          ↓
Internal Target → SLO (what to aim for)
          ↓
External Promise → SLA (what you guarantee)

Implementing SLIs in Go

Let me show you how I measure SLIs in my Go applications. I'll use my URL shortener as the example.

1. Availability SLI

Availability is the percentage of successful requests over total requests.

// pkg/metrics/sli.go
package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Request counter with success/failure labels
    requestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    }, []string{"method", "endpoint", "status_class"})

    // For calculating availability SLI
    sliAvailability = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "sli_availability_total",
        Help: "SLI: Total requests for availability calculation",
    }, []string{"result"}) // result: success or failure
)

// RecordAvailabilitySLI records whether a request was successful
func RecordAvailabilitySLI(statusCode int) {
    // Count as success if status < 500 (4xx are user errors, not service failures)
    if statusCode < 500 {
        sliAvailability.WithLabelValues("success").Inc()
    } else {
        sliAvailability.WithLabelValues("failure").Inc()
    }
}

In Prometheus, I can then calculate availability:

# Availability over last 30 days (as percentage)
sum(rate(sli_availability_total{result="success"}[30d])) 
/ 
sum(rate(sli_availability_total[30d])) * 100

2. Latency SLI

For latency, I use histograms to track the distribution of response times.

// pkg/metrics/sli.go

var (
    // Latency histogram with custom buckets
    requestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Help: "SLI: HTTP request latency in seconds",
        // Custom buckets: 50ms, 100ms, 200ms, 500ms, 1s, 2s, 5s
        Buckets: []float64{0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0},
    }, []string{"method", "endpoint"})
)

// RecordLatencySLI records request latency
func RecordLatencySLI(method, endpoint string, durationSeconds float64) {
    requestDuration.WithLabelValues(method, endpoint).Observe(durationSeconds)
}

PromQL to check if we're meeting our 95th percentile latency SLO:

# 95th percentile latency over last 30 days
histogram_quantile(0.95, 
    sum(rate(http_request_duration_seconds_bucket[30d])) by (le)
)

3. Complete Middleware Implementation

Here's how I integrate SLI recording into my HTTP middleware:

// internal/middleware/sli.go
package middleware

import (
    "net/http"
    "strconv"
    "time"

    "github.com/yourusername/goto-link/pkg/metrics"
)

type sliResponseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (w *sliResponseWriter) WriteHeader(code int) {
    w.statusCode = code
    w.ResponseWriter.WriteHeader(code)
}

// SLIMiddleware records SLI metrics for every request
func SLIMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap response writer to capture status code
        wrapped := &sliResponseWriter{
            ResponseWriter: w,
            statusCode:     200, // default
        }

        // Execute request
        next.ServeHTTP(wrapped, r)

        // Record SLI metrics
        duration := time.Since(start).Seconds()
        
        // Record availability
        metrics.RecordAvailabilitySLI(wrapped.statusCode)
        
        // Record latency
        metrics.RecordLatencySLI(r.Method, r.URL.Path, duration)
        
        // Record general metrics
        statusClass := getStatusClass(wrapped.statusCode)
        metrics.RecordRequest(r.Method, r.URL.Path, statusClass)
    })
}

func getStatusClass(code int) string {
    switch {
    case code >= 200 && code < 300:
        return "2xx"
    case code >= 300 && code < 400:
        return "3xx"
    case code >= 400 && code < 500:
        return "4xx"
    case code >= 500:
        return "5xx"
    default:
        return "unknown"
    }
}

4. Application Integration

// cmd/server/main.go
package main

import (
    "context"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/rs/zerolog/log"

    "github.com/yourusername/goto-link/internal/handlers"
    "github.com/yourusername/goto-link/internal/middleware"
)

func main() {
    // Set up routes
    mux := http.NewServeMux()

    // Business endpoints
    linkHandler := handlers.NewLinkHandler()
    mux.HandleFunc("/api/links", linkHandler.Create)
    mux.HandleFunc("/", linkHandler.Redirect)

    // Metrics endpoint (no SLI middleware on this)
    mux.Handle("/metrics", promhttp.Handler())

    // Health endpoints
    mux.HandleFunc("/health/live", healthLive)
    mux.HandleFunc("/health/ready", healthReady)

    // Wrap with SLI middleware
    handler := middleware.SLIMiddleware(mux)

    // Start server
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      handler,
        ReadTimeout:  10 * time.Second,
        WriteTimeout: 10 * time.Second,
    }

    // ... graceful shutdown code
}

Defining Your SLOs: The Process I Follow

When I start a new service, here's my process for defining SLOs:

Step 1: Understand Your Users

For my URL shortener, I asked:

Who uses this service? (Friends, family, my blog readers)
What do they expect? (Fast redirects, high availability)
What's their tolerance for downtime? (A few minutes is fine, hours is not)

Step 2: Measure Current Performance

Before setting targets, I ran my service for a month and measured actual performance:

Current Performance (30 days):
- Availability: 99.92%
- P95 Latency: 145ms
- P99 Latency: 320ms
- Error Rate: 0.03%

Step 3: Set Realistic SLOs

I don't aim for perfection. Instead, I ask: "What's good enough?"

My SLOs for goto.link:

SLI

SLO Target

Why This Number

Availability

99.9%

~43min downtime/month acceptable for a free service

P95 Latency

200ms

Fast enough for good UX, achievable with current architecture

P99 Latency

500ms

Handles outliers, allows for occasional slow requests

Error Rate

< 0.1%

Most users won't encounter errors

Key insight: I didn't pick 99.99% availability because:

It would require significant infrastructure investment
My users don't need that level of reliability
It would slow down feature development

Step 4: Define the Error Budget

The error budget is the difference between 100% and your SLO.

For 99.9% availability:

Error budget: 0.1% = 43.2 minutes of downtime per month
That's my allowance for failures, deployments, experiments

Using Error Budgets to Make Decisions

Error budgets changed how I make engineering decisions. Here's how I use them:

Scenario 1: Plenty of Error Budget Remaining

Month: January
Availability SLO: 99.9%
Current Availability: 99.95%
Error Budget Used: 50% (21.6 minutes of 43.2)

Decision: I have budget to spare! I can:

Deploy new features more aggressively
Experiment with new infrastructure
Take calculated risks

Scenario 2: Error Budget Nearly Exhausted

Month: February
Availability SLO: 99.9%
Current Availability: 99.91%
Error Budget Used: 90% (38.9 minutes of 43.2)

Decision: Slow down! I should:

Freeze feature deployments
Focus on reliability improvements
Review recent incidents
Only deploy critical bug fixes

Error Budget Policy I Created

I documented this policy for my projects:

## Error Budget Policy for goto.link

### When Error Budget > 50% Remaining
- ✅ Normal deployment cadence
- ✅ Can experiment with new features
- ✅ Can try new infrastructure

### When Error Budget 25-50% Remaining
- ⚠️ Slow down deployments
- ⚠️ Require extra testing before deploys
- ⚠️ Review recent incidents

### When Error Budget < 25% Remaining
- 🚨 Feature freeze
- 🚨 Focus on reliability only
- 🚨 Root cause analysis of all incidents
- 🚨 Only critical bug fixes deployed

Tracking SLOs with Prometheus and Grafana

I built a dashboard to track my SLOs in real-time.

Prometheus Recording Rules

First, I create recording rules to pre-calculate SLI values:

# prometheus/rules/slo.yml
groups:
  - name: slo
    interval: 30s
    rules:
      # Availability SLI
      - record: slo:availability:ratio_rate30d
        expr: |
          sum(rate(sli_availability_total{result="success"}[30d]))
          /
          sum(rate(sli_availability_total[30d]))

      # Latency SLI - P95
      - record: slo:latency:p95_30d
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[30d])) by (le)
          )

      # Latency SLI - P99
      - record: slo:latency:p99_30d
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[30d])) by (le)
          )

      # Error budget burn rate (how fast we're using our budget)
      - record: slo:availability:error_budget_remaining
        expr: |
          1 - (
            (1 - slo:availability:ratio_rate30d) / (1 - 0.999)
          )

Alerting Rules

I alert when I'm burning through error budget too quickly:

# prometheus/rules/alerts.yml
groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 2% budget consumed in 1 hour
      - alert: ErrorBudgetBurnRateFast
        expr: |
          (
            1 - (sum(rate(sli_availability_total{result="success"}[1h]))
            /
            sum(rate(sli_availability_total[1h])))
          ) > (0.001 * 2)  # 2% of monthly budget in 1 hour
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn rate detected"
          description: "Burning through error budget at {{ $value }} per hour"

      # Slow burn: SLO violation for extended period
      - alert: SLOViolation
        expr: slo:availability:ratio_rate30d < 0.999
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Availability SLO violation"
          description: "Availability is {{ $value }}, SLO is 0.999"

      # Latency SLO violation
      - alert: LatencySLOViolation
        expr: slo:latency:p95_30d > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latency P95 SLO violation"
          description: "P95 latency is {{ $value }}s, SLO is 0.2s"

Grafana Dashboard

Here's the JSON for my SLO dashboard panel:

{
  "panels": [
    {
      "title": "Availability SLO",
      "targets": [
        {
          "expr": "slo:availability:ratio_rate30d * 100",
          "legendFormat": "Current Availability"
        }
      ],
      "thresholds": [
        {
          "value": 99.9,
          "color": "green"
        },
        {
          "value": 99.5,
          "color": "yellow"
        },
        {
          "value": 99.0,
          "color": "red"
        }
      ]
    },
    {
      "title": "Error Budget Remaining",
      "targets": [
        {
          "expr": "slo:availability:error_budget_remaining * 100",
          "legendFormat": "Budget Remaining %"
        }
      ],
      "thresholds": [
        {
          "value": 50,
          "color": "green"
        },
        {
          "value": 25,
          "color": "yellow"
        },
        {
          "value": 10,
          "color": "red"
        }
      ]
    }
  ]
}

Real-World Example: When I Broke My SLO

In March, I deployed a new feature to my URL shortener that cached redirect URLs in Redis. Within 2 hours, I got this alert:

CRITICAL: ErrorBudgetBurnRateFast
Burning through error budget at 5% per hour
Current availability: 98.2%

I checked my logs and found the Redis connection pool was exhausted, causing 5xx errors. I had two choices:

Roll back immediately - Restore availability
Debug in production - Risk consuming more error budget

Because I was burning 5% per hour and only had 40% budget remaining, I rolled back. The decision was easy because I had the data.

After the rollback:

Availability recovered to 99.95%
I debugged locally
Fixed the connection pool settings
Re-deployed with proper load testing

Total error budget used: 8% (about 3.5 minutes of downtime) Remaining budget: 32% (still okay for the month)

Common Mistakes I Made (So You Don't Have To)

Mistake 1: Setting SLOs Too High

My first attempt: "99.99% availability!"

Result: I spent all my time on reliability and barely shipped features. I burned out.

Lesson: Choose SLOs that match your users' needs, not your ego.

Mistake 2: Measuring the Wrong Things

Initially, I measured "server availability" (is the process running?).

Result: The server was "available" but users were experiencing 10-second response times.

Lesson: Measure user experience, not server health.

Mistake 3: Not Using Error Budgets

I had SLOs but didn't track error budgets.

Result: I had no framework for deciding when to ship vs. when to focus on reliability.

Lesson: Error budgets turn reliability into a currency that guides decisions.

Mistake 4: Too Many SLIs

I tried to track 15 different SLIs.

Result: Analysis paralysis. I couldn't figure out what mattered.

Lesson: Start with 2-4 critical SLIs. You can always add more.

Choosing Your SLIs: A Decision Framework

Not sure which SLIs to track? Here's my framework:

For API Services (like my URL shortener)

Must Have:

Availability: % of successful requests
Latency: P95 or P99 response time

Nice to Have: 3. Error Rate: % of requests returning errors 4. Throughput: Requests per second (for capacity planning)

For Batch Processing Services

Must Have:

Success Rate: % of jobs completing successfully
Processing Time: Time to complete a job

Nice to Have: 3. Freshness: How old is the oldest unprocessed item? 4. Queue Depth: How many items waiting to be processed?

For Data Pipeline Services

Must Have:

Data Freshness: How old is the latest data?
Completeness: % of expected data received

Nice to Have: 3. Processing Latency: Time from data arrival to processing 4. Error Rate: % of failed processing attempts

Key Takeaways

After implementing SLIs, SLOs, and error budgets across my Go services:

SLIs must measure user experience, not server health. If users are happy but CPU is high, that's fine. If CPU is perfect but users are frustrated, that's a problem.
SLOs should be realistic, not aspirational. Don't aim for 99.99% if you can't sustain it or don't need it.
Error budgets are powerful because they convert reliability into a currency. "We have 30% budget remaining" is more actionable than "uptime is good."
Start simple with 2-3 critical SLIs. You can always add more as you mature.
Document your policies around error budgets so everyone knows what happens at different thresholds.

What's Next

Now that you have a reliability framework with SLIs, SLOs, and error budgets, the next challenge is observability - actually seeing what's happening in your systems.

In Part 3, we'll dive deep into:

The four golden signals of monitoring
Building comprehensive observability with Prometheus, logs, and traces
Creating dashboards that help during incidents
Distributed tracing with OpenTelemetry

Resources

Google's Art of SLOs - Comprehensive guide to SLOs
SRE Book - Service Level Objectives - Chapter 4 of the SRE book
Prometheus Best Practices - Metrics and monitoring best practices
SLO Generator - Tool to compute SLO metrics

Conclusion

Before learning about SLIs, SLOs, and SLAs, I had no framework for answering "how reliable should my service be?" Now I have:

Clear metrics that represent user experience (SLIs)
Concrete targets for reliability (SLOs)
A decision-making framework based on error budgets
Data-driven conversations about reliability vs. velocity

The framework isn't perfect, and my SLOs evolve as I learn more about my users. But having this structure transformed me from guessing about reliability to measuring and improving it systematically.

Start with one service. Pick 2-3 SLIs. Set SLOs slightly below your current performance. Track your error budget. You'll be amazed at how much clarity this brings.

PreviousPart 1: Introduction to SRE - My Journey from Developer to SRE Mindset NextPart 3: Monitoring and Observability - Seeing What Your System Is Really Doing

Last updated 11 days ago

hashtagWhen "It Works on My Machine" Isn't Good Enough

hashtagThe Reliability Triangle: SLIs, SLOs, and SLAs

hashtagService Level Indicators (SLIs)

hashtagService Level Objectives (SLOs)

hashtagService Level Agreements (SLAs)

hashtagThe Relationship

hashtagImplementing SLIs in Go

hashtag1. Availability SLI

hashtag2. Latency SLI

hashtag3. Complete Middleware Implementation

hashtag4. Application Integration

hashtagDefining Your SLOs: The Process I Follow

hashtagStep 1: Understand Your Users

hashtagStep 2: Measure Current Performance

hashtagStep 3: Set Realistic SLOs

hashtagStep 4: Define the Error Budget

hashtagUsing Error Budgets to Make Decisions

hashtagScenario 1: Plenty of Error Budget Remaining

hashtagScenario 2: Error Budget Nearly Exhausted

hashtagError Budget Policy I Created

hashtagTracking SLOs with Prometheus and Grafana

hashtagPrometheus Recording Rules

hashtagAlerting Rules

hashtagGrafana Dashboard

hashtagReal-World Example: When I Broke My SLO

hashtagCommon Mistakes I Made (So You Don't Have To)

hashtagMistake 1: Setting SLOs Too High

hashtagMistake 2: Measuring the Wrong Things

hashtagMistake 3: Not Using Error Budgets

hashtagMistake 4: Too Many SLIs

hashtagChoosing Your SLIs: A Decision Framework

hashtagFor API Services (like my URL shortener)

hashtagFor Batch Processing Services

hashtagFor Data Pipeline Services

hashtagKey Takeaways

hashtagWhat's Next

hashtagResources

hashtagConclusion

When "It Works on My Machine" Isn't Good Enough

The Reliability Triangle: SLIs, SLOs, and SLAs

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreements (SLAs)

The Relationship

Implementing SLIs in Go

1. Availability SLI

2. Latency SLI

3. Complete Middleware Implementation

4. Application Integration

Defining Your SLOs: The Process I Follow

Step 1: Understand Your Users

Step 2: Measure Current Performance

Step 3: Set Realistic SLOs

Step 4: Define the Error Budget

Using Error Budgets to Make Decisions

Scenario 1: Plenty of Error Budget Remaining

Scenario 2: Error Budget Nearly Exhausted

Error Budget Policy I Created

Tracking SLOs with Prometheus and Grafana

Prometheus Recording Rules

Alerting Rules

Grafana Dashboard

Real-World Example: When I Broke My SLO

Common Mistakes I Made (So You Don't Have To)

Mistake 1: Setting SLOs Too High

Mistake 2: Measuring the Wrong Things

Mistake 3: Not Using Error Budgets

Mistake 4: Too Many SLIs

Choosing Your SLIs: A Decision Framework

For API Services (like my URL shortener)

For Batch Processing Services

For Data Pipeline Services

Key Takeaways

What's Next

Resources

Conclusion