Part 6: Service Reliability Metrics

The Day We Learned Uptime Isn't Everything

For years, I proudly boasted about our "five nines" (99.999%) uptime. Then during a customer review meeting, a major client said: "Your system is technically up, but our checkout process fails 15% of the time. That's not reliable."

They were right. We were measuring availability but not reliability. The service was running, but it wasn't working correctly. That wake-up call led me to completely rethink how we measure and maintain service reliability.

Understanding SLIs, SLOs, and SLAs

These terms get thrown around interchangeably, but they mean different things. Let me explain how I use them.

SLI (Service Level Indicator)

An SLI is a quantitative measure of service behavior. It's what you actually measure.

Examples of SLIs I track:

  • Request success rate: (successful_requests / total_requests) * 100

  • Request latency (P50, P95, P99): Time taken to process requests

  • Error rate: (5xx_errors / total_requests) * 100

  • Availability: (uptime / total_time) * 100

  • Data freshness: Time since last successful data sync

SLO (Service Level Objective)

An SLO is your internal target for an SLI. It's a specific goal like "99.9% of requests should succeed" or "95% of requests should complete in under 200ms."

My SLOs for a payment API:

  • Availability: 99.95% of requests succeed

  • Latency (P95): 300ms or less

  • Latency (P99): 1000ms or less

  • Error rate: Less than 0.05%

SLA (Service Level Agreement)

An SLA is a promise to customers with consequences if you miss it. It's typically more lenient than your SLO (you want buffer room).

Example SLA I provide:

  • "Payment API will be available 99.9% of the time, measured monthly"

  • "If we fail to meet this, you'll receive a 10% service credit"

The relationship:

This gives us room to detect problems before customers are impacted.

Defining Meaningful SLOs

Bad SLOs are vague: "The system should be fast and reliable." Good SLOs are specific, measurable, and tied to user experience.

My SLO Selection Process

I ask these questions:

  1. What do users care about? (not what's easy to measure)

  2. What level of reliability is good enough? (perfection is impossible and expensive)

  3. What can we realistically achieve? (based on current architecture)

  4. What's the cost of improving? (diminishing returns after a point)

Example: Payment Processing Service

User expectation: "I can complete a purchase quickly and reliably"

Translation to SLOs:

Implementing SLIs with Prometheus

I instrument applications to expose metrics that Prometheus scrapes.

Instrumenting a Node.js Application

Middleware to Track Requests

Metrics Endpoint

Prometheus Configuration

Calculating Error Budgets

An error budget is how much unreliability you can tolerate before breaking your SLO.

Error Budget Calculation

If your SLO is 99.95% availability over 30 days:

Error budget = 21.6 minutes of downtime per month

If you've used 15 minutes so far this month, you have 6.6 minutes remaining.

Error Budget Policy

I implement policies based on error budget:

Error budget remaining > 90%:

  • Team focuses on feature development

  • Aggressive deployment frequency

  • Experiment with new technologies

Error budget remaining 50-90%:

  • Balanced approach

  • Normal deployment frequency

  • Standard risk tolerance

Error budget remaining 10-50%:

  • Increased caution

  • Reduce deployment frequency

  • Focus on stability improvements

  • More extensive testing

Error budget exhausted (<10%):

  • Freeze new feature releases

  • Focus 100% on reliability

  • Root cause analysis of incidents

  • Pay down technical debt

  • Only critical bug fixes and reliability improvements

Tracking Error Budget

Error Budget Dashboard

I create a Grafana dashboard showing error budget status:

Uptime Practices

Beyond measuring reliability, you need practices to maintain it.

Multi-Region Deployment

I deploy critical services across multiple AWS regions:

Circuit Breakers

I implement circuit breakers to prevent cascading failures:

Rate Limiting

I implement rate limiting to protect services from overload:

Graceful Degradation

I implement graceful degradation for non-critical features:

SLO Monitoring and Alerting

I configure alerts based on SLO burn rateβ€”how fast we're consuming error budget.

Prometheus Alert Rules

Key Takeaways

  1. SLIs measure, SLOs target, SLAs promise: Know the difference and set appropriate thresholds

  2. Error budgets enable balanced risk-taking: Use them to make deployment decisions

  3. Instrument everything: You can't improve what you don't measure

  4. Multi-layered defense: Circuit breakers, rate limiting, retries, timeouts, failover

  5. Graceful degradation: Non-critical features should fail without taking down the system

  6. Alert on burn rate, not absolute values: Fast burns need immediate attention, slow burns need investigation

In the next part, we'll cover incident response and managementβ€”what to do when things go wrong despite all these precautions.


Previous: Part 5: Standardization and Reproducible Deployments Next: Part 7: Incident Response and Management

Last updated