Part 5: SLIs, SLOs, and Error Budgets in Practice

Part of the SRE Playbook series

What You'll Learn: This article covers how I define and implement SLIs from the Prometheus metrics built in Part 4, write SLO definitions using Sloth for Go services, calculate error budgets, and configure multi-window multi-burn-rate alerts that actually wake you up for the right reasons. Everything is grounded in the GoReliable platform — real metrics, real thresholds, real alert logic.

The Alert That Woke Me Up for the Wrong Reason

Before I implemented proper SLOs, my alerting was threshold-based: "alert if error rate > 1%." One night I got paged at midnight because the error rate hit 1.2% for 90 seconds during a batch job. Impact: essentially zero — a handful of retried requests, no user complaints, no business impact.

That false page cost me an hour of sleep and zero reliability improvement. The problem was I was alerting on instantaneous symptoms rather than on user experience over time.

SLOs change the framing. Instead of "error rate > 1%", the question is: "At the current rate of errors, how quickly are we burning through our error budget?" If we're burning it at 14× the normal rate, that's a genuine emergency. If we're burning it at 0.3× normal, a ticket will suffice.

For foundational SRE concepts, see SRE 101: SLIs, SLOs, and SLAs. This article focuses on the implementation.

Choosing SLIs for the GoReliable Platform

An SLI (Service Level Indicator) measures an aspect of service quality. I use the metrics from Part 4 to define four SLIs:

SLI 1: API Gateway Availability

SLI = (requests that returned non-5xx) / (total requests)

In Prometheus:

sum(rate(goreliable_api_gateway_http_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(goreliable_api_gateway_http_requests_total[5m]))

SLI 2: API Gateway Latency

SLI = (requests completing in ≤ 300ms) / (total requests)

sum(rate(goreliable_api_gateway_http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(goreliable_api_gateway_http_request_duration_seconds_count[5m]))

I chose 300ms as the latency threshold from looking at my actual latency distribution. My p95 was around 180ms under normal load; 300ms gives headroom while catching genuine degradation.

SLI 3: Order Creation Success Rate

Orders failing means revenue impact. This is the most critical SLI.

sum(rate(goreliable_order_service_orders_created_total[5m]))
/
(sum(rate(goreliable_order_service_orders_created_total[5m])) + sum(rate(goreliable_order_service_orders_failed_total[5m])))

SLI 4: Notification Delivery

Notifications are best-effort (as designed in Part 1), so I track them but set a more relaxed SLO.

sum(rate(goreliable_notification_worker_messages_processed_total{status="success"}[5m]))
/
sum(rate(goreliable_notification_worker_messages_processed_total[5m]))

Defining SLOs with Sloth

I use Sloth to generate the Prometheus recording rules and alerting rules from a declarative SLO definition. This is much cleaner than writing multi-window alert rules by hand.

# slos/api-gateway.yaml
version: "prometheus/v1"
service: "api-gateway"
labels:
  team: "platform"
  env: "production"

slos:
  - name: "requests-availability"
    objective: 99.9      # 99.9% of requests succeed
    description: "99.9% of API gateway requests should return non-5xx status codes"
    sli:
      events:
        error_query: |
          sum(rate(goreliable_api_gateway_http_requests_total{status_code=~"5.."}[{{.window}}]))
        total_query: |
          sum(rate(goreliable_api_gateway_http_requests_total[{{.window}}]))
    alerting:
      name: APIGatewayHighErrorRate
      labels:
        category: "availability"
      annotations:
        summary: "API gateway error rate is burning error budget"
        runbook: "https://github.com/htunn/go-reliable-gitops/blob/main/runbooks/api-gateway-errors.md"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

  - name: "requests-latency"
    objective: 99.0      # 99% of requests complete within 300ms
    description: "99% of API gateway requests should complete in ≤300ms"
    sli:
      events:
        error_query: |
          sum(rate(goreliable_api_gateway_http_request_duration_seconds_bucket{le="0.3"}[{{.window}}]))
        total_query: |
          sum(rate(goreliable_api_gateway_http_request_duration_seconds_count[{{.window}}]))
    alerting:
      name: APIGatewayHighLatency
      labels:
        category: "latency"
      annotations:
        summary: "API gateway latency is burning error budget"
        runbook: "https://github.com/htunn/go-reliable-gitops/blob/main/runbooks/api-gateway-latency.md"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

I run Sloth in CI to generate the Prometheus rules:

# In the GitOps repo CI pipeline
sloth generate -i slos/api-gateway.yaml -o monitoring/prometheus-rules/api-gateway-slos.yaml

The generated output includes recording rules for 5-minute and 30-minute burn rates, and alerting rules for multiple time windows. Here's a simplified excerpt of what Sloth generates:

# Generated by Sloth — do not edit manually
groups:
  - name: sloth-slo-api-gateway-requests-availability
    rules:
      # Error ratio recording rules for two windows
      - record: slo:sli_error:ratio_rate5m
        expr: |
          (sum(rate(goreliable_api_gateway_http_requests_total{status_code=~"5.."}[5m])))
          /
          (sum(rate(goreliable_api_gateway_http_requests_total[5m])))
        labels:
          sloth_window: "5m"
          sloth_slo: "requests-availability"

      - record: slo:sli_error:ratio_rate1h
        expr: |
          (sum(rate(goreliable_api_gateway_http_requests_total{status_code=~"5.."}[1h])))
          /
          (sum(rate(goreliable_api_gateway_http_requests_total[1h])))
        labels:
          sloth_window: "1h"
          sloth_slo: "requests-availability"

      # Multi-window burn-rate alert: fires if burning at >14x in both 5m and 1h windows
      - alert: APIGatewayHighErrorRatePage
        expr: |
          (slo:sli_error:ratio_rate5m{sloth_slo="requests-availability"} > (14 * (1 - 0.999)))
          and
          (slo:sli_error:ratio_rate1h{sloth_slo="requests-availability"} > (14 * (1 - 0.999)))
        for: 2m
        labels:
          severity: critical
          sloth_burn_rate: "14"

Understanding Multi-Window Burn Rate

The generated alert uses two windows: 5 minutes (fast detection) and 1 hour (confirmation). The alert fires only when both windows show high burn rate. This combination reduces false positives:

5m only: Would fire on brief spikes that self-resolve
1h only: Takes too long to detect a real incident
Both together: Detects real problems fast while filtering transient spikes

The burn rate thresholds come from the error budget math:

Monthly error budget for 99.9% SLO: 0.001 × 30 days × 24h × 60min = 43.2 minutes of allowable errors
14× burn rate page alert: At this rate, the entire 30-day budget burns in ~52 hours. That's fast enough to warrant waking someone up.
2× burn rate ticket alert: Budget burns in ~15 days. Not an emergency, but create a ticket.

Calculating Error Budget Remaining

I add a recording rule to track remaining error budget:

# In prometheus-rules/api-gateway-slos.yaml
- record: slo:error_budget_remaining:ratio
  expr: |
    1 - (
      sum(increase(goreliable_api_gateway_http_requests_total{status_code=~"5.."}[30d]))
      /
      sum(increase(goreliable_api_gateway_http_requests_total[30d]))
    ) / (1 - 0.999)
  labels:
    sloth_slo: "requests-availability"

A value of 1.0 = full budget remaining. 0.0 = budget exhausted. Negative = over budget.

I display this prominently on the Grafana dashboard. When it's below 0.2 (20% remaining), I slow down risky deployments.

Error Budget Policy

An error budget is only actionable if the team agrees ahead of time what to do when it's consumed. My personal policy for the GoReliable platform:

Budget Remaining

Action

> 50%

Normal deployment cadence

25%–50%

Review upcoming deployments for risk

10%–25%

Defer non-critical deployments, focus on reliability work

< 10%

Feature freeze, all effort on reliability improvements

Exhausted (0%)

Stop all deploys, incident response mode

For a single-engineer personal project, this policy is mostly personal discipline — there's no team to negotiate with. But writing it down means I don't decide in the moment under pressure. When the budget hits 10%, I already know what "the plan" is.

SLI Measurement Middleware in Go

The Go instrumentation from Part 4 already generates the raw metrics. But I add an explicit SLI recording wrapper in the order service to make the metrics semantically clear:

// internal/order/sli.go
package order

import (
    "context"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

// SLIRecorder wraps the repository to record SLI-relevant metrics
type SLIRecorder struct {
    inner    Repository
    requests *prometheus.CounterVec
    latency  *prometheus.HistogramVec
}

var (
    orderRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Namespace: "goreliable",
            Subsystem: "order_sli",
            Name:      "requests_total",
            Help:      "Order service SLI: total requests classified as good or bad",
        },
        []string{"operation", "result"}, // result: "good" or "bad"
    )

    orderRequestLatency = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "goreliable",
            Subsystem: "order_sli",
            Name:      "request_duration_seconds",
            Help:      "Order service SLI: request latency",
            Buckets:   []float64{0.05, 0.1, 0.2, 0.3, 0.5, 1, 2},
        },
        []string{"operation"},
    )
)

func (r *SLIRecorder) Create(ctx context.Context, order *Order) error {
    start := time.Now()
    err := r.inner.Create(ctx, order)
    duration := time.Since(start)

    result := "good"
    if err != nil {
        result = "bad"
    }

    orderRequestsTotal.With(prometheus.Labels{
        "operation": "create",
        "result":    result,
    }).Inc()

    orderRequestLatency.With(prometheus.Labels{
        "operation": "create",
    }).Observe(duration.Seconds())

    return err
}

The SLI recorder wraps the real repository. The service layer doesn't know it's recording SLI data — the observability is transparent.

Viewing Error Budget in Practice

Here's the Grafana panel definition I added to the Grafana dashboard:

{
  "title": "Error Budget Remaining (30d)",
  "type": "gauge",
  "targets": [
    {
      "expr": "slo:error_budget_remaining:ratio{sloth_slo=\"requests-availability\"} * 100",
      "legendFormat": "Availability SLO"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "thresholds": {
        "steps": [
          {"color": "red",    "value": 0},
          {"color": "orange", "value": 10},
          {"color": "yellow", "value": 25},
          {"color": "green",  "value": 50}
        ]
      }
    }
  }
}

The gauge sitting on the dashboard tells me at a glance whether I have deployment headroom or should be working on reliability.

What I Learned Tuning These

Latency SLO threshold took multiple iterations. I initially set it at 200ms based on intuition. Checking the actual histogram data showed my p95 was 180ms under moderate load and 240ms under peak load — so 200ms was causing false budget burns. I adjusted to 300ms to reflect realistic peak behavior.

The 30-day window matters. Using a [30d] window for budget calculation means a bad hour early in the month is amortized. If I used a 7-day window, a single bad day would consume half the budget and trigger a feature freeze. 30 days better reflects the "how reliable is this service for its users" question.

In Part 6, the alerts generated by these SLOs need somewhere to go. I wire Alertmanager to Slack and PagerDuty, write runbooks stored in the GitOps repo, and build a small Go CLI for incident response.

PreviousPart 4: Instrumenting Go Services — Metrics, Traces, and Logs NextPart 6: Incident Management and On-Call Automation

Last updated 4 days ago

hashtagThe Alert That Woke Me Up for the Wrong Reason

hashtagChoosing SLIs for the GoReliable Platform

hashtagSLI 1: API Gateway Availability

hashtagSLI 2: API Gateway Latency

hashtagSLI 3: Order Creation Success Rate

hashtagSLI 4: Notification Delivery

hashtagDefining SLOs with Sloth

hashtagUnderstanding Multi-Window Burn Rate

hashtagCalculating Error Budget Remaining

hashtagError Budget Policy

hashtagSLI Measurement Middleware in Go

hashtagViewing Error Budget in Practice

hashtagWhat I Learned Tuning These