Part 5: SLIs, SLOs, and Error Budgets in Practice

Part of the SRE Playbook series

What You'll Learn: This article covers how I define and implement SLIs from the Prometheus metrics built in Part 4, write SLO definitions using Sloth for Go services, calculate error budgets, and configure multi-window multi-burn-rate alerts that actually wake you up for the right reasons. Everything is grounded in the GoReliable platform β€” real metrics, real thresholds, real alert logic.

The Alert That Woke Me Up for the Wrong Reason

Before I implemented proper SLOs, my alerting was threshold-based: "alert if error rate > 1%." One night I got paged at midnight because the error rate hit 1.2% for 90 seconds during a batch job. Impact: essentially zero β€” a handful of retried requests, no user complaints, no business impact.

That false page cost me an hour of sleep and zero reliability improvement. The problem was I was alerting on instantaneous symptoms rather than on user experience over time.

SLOs change the framing. Instead of "error rate > 1%", the question is: "At the current rate of errors, how quickly are we burning through our error budget?" If we're burning it at 14Γ— the normal rate, that's a genuine emergency. If we're burning it at 0.3Γ— normal, a ticket will suffice.

For foundational SRE concepts, see SRE 101: SLIs, SLOs, and SLAs. This article focuses on the implementation.

Choosing SLIs for the GoReliable Platform

An SLI (Service Level Indicator) measures an aspect of service quality. I use the metrics from Part 4 to define four SLIs:

SLI 1: API Gateway Availability

SLI = (requests that returned non-5xx) / (total requests)

In Prometheus:

sum(rate(goreliable_api_gateway_http_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(goreliable_api_gateway_http_requests_total[5m]))

SLI 2: API Gateway Latency

I chose 300ms as the latency threshold from looking at my actual latency distribution. My p95 was around 180ms under normal load; 300ms gives headroom while catching genuine degradation.

SLI 3: Order Creation Success Rate

Orders failing means revenue impact. This is the most critical SLI.

SLI 4: Notification Delivery

Notifications are best-effort (as designed in Part 1), so I track them but set a more relaxed SLO.

Defining SLOs with Sloth

I use Slotharrow-up-right to generate the Prometheus recording rules and alerting rules from a declarative SLO definition. This is much cleaner than writing multi-window alert rules by hand.

I run Sloth in CI to generate the Prometheus rules:

The generated output includes recording rules for 5-minute and 30-minute burn rates, and alerting rules for multiple time windows. Here's a simplified excerpt of what Sloth generates:

Understanding Multi-Window Burn Rate

The generated alert uses two windows: 5 minutes (fast detection) and 1 hour (confirmation). The alert fires only when both windows show high burn rate. This combination reduces false positives:

  • 5m only: Would fire on brief spikes that self-resolve

  • 1h only: Takes too long to detect a real incident

  • Both together: Detects real problems fast while filtering transient spikes

The burn rate thresholds come from the error budget math:

  • Monthly error budget for 99.9% SLO: 0.001 Γ— 30 days Γ— 24h Γ— 60min = 43.2 minutes of allowable errors

  • 14Γ— burn rate page alert: At this rate, the entire 30-day budget burns in ~52 hours. That's fast enough to warrant waking someone up.

  • 2Γ— burn rate ticket alert: Budget burns in ~15 days. Not an emergency, but create a ticket.

Calculating Error Budget Remaining

I add a recording rule to track remaining error budget:

A value of 1.0 = full budget remaining. 0.0 = budget exhausted. Negative = over budget.

I display this prominently on the Grafana dashboard. When it's below 0.2 (20% remaining), I slow down risky deployments.

Error Budget Policy

An error budget is only actionable if the team agrees ahead of time what to do when it's consumed. My personal policy for the GoReliable platform:

Budget Remaining
Action

> 50%

Normal deployment cadence

25%–50%

Review upcoming deployments for risk

10%–25%

Defer non-critical deployments, focus on reliability work

< 10%

Feature freeze, all effort on reliability improvements

Exhausted (0%)

Stop all deploys, incident response mode

For a single-engineer personal project, this policy is mostly personal discipline β€” there's no team to negotiate with. But writing it down means I don't decide in the moment under pressure. When the budget hits 10%, I already know what "the plan" is.

SLI Measurement Middleware in Go

The Go instrumentation from Part 4 already generates the raw metrics. But I add an explicit SLI recording wrapper in the order service to make the metrics semantically clear:

The SLI recorder wraps the real repository. The service layer doesn't know it's recording SLI data β€” the observability is transparent.

Viewing Error Budget in Practice

Here's the Grafana panel definition I added to the Grafana dashboard:

The gauge sitting on the dashboard tells me at a glance whether I have deployment headroom or should be working on reliability.

What I Learned Tuning These

Latency SLO threshold took multiple iterations. I initially set it at 200ms based on intuition. Checking the actual histogram data showed my p95 was 180ms under moderate load and 240ms under peak load β€” so 200ms was causing false budget burns. I adjusted to 300ms to reflect realistic peak behavior.

The 30-day window matters. Using a [30d] window for budget calculation means a bad hour early in the month is amortized. If I used a 7-day window, a single bad day would consume half the budget and trigger a feature freeze. 30 days better reflects the "how reliable is this service for its users" question.

In Part 6, the alerts generated by these SLOs need somewhere to go. I wire Alertmanager to Slack and PagerDuty, write runbooks stored in the GitOps repo, and build a small Go CLI for incident response.

Last updated