Part 6: Service Reliability Metrics

The Day We Learned Uptime Isn't Everything

For years, I proudly boasted about our "five nines" (99.999%) uptime. Then during a customer review meeting, a major client said: "Your system is technically up, but our checkout process fails 15% of the time. That's not reliable."

They were right. We were measuring availability but not reliability. The service was running, but it wasn't working correctly. That wake-up call led me to completely rethink how we measure and maintain service reliability.

Understanding SLIs, SLOs, and SLAs

These terms get thrown around interchangeably, but they mean different things. Let me explain how I use them.

SLI (Service Level Indicator)

An SLI is a quantitative measure of service behavior. It's what you actually measure.

Examples of SLIs I track:

Request success rate: (successful_requests / total_requests) * 100
Request latency (P50, P95, P99): Time taken to process requests
Error rate: (5xx_errors / total_requests) * 100
Availability: (uptime / total_time) * 100
Data freshness: Time since last successful data sync

SLO (Service Level Objective)

An SLO is your internal target for an SLI. It's a specific goal like "99.9% of requests should succeed" or "95% of requests should complete in under 200ms."

My SLOs for a payment API:

Availability: 99.95% of requests succeed
Latency (P95): 300ms or less
Latency (P99): 1000ms or less
Error rate: Less than 0.05%

SLA (Service Level Agreement)

An SLA is a promise to customers with consequences if you miss it. It's typically more lenient than your SLO (you want buffer room).

Example SLA I provide:

"Payment API will be available 99.9% of the time, measured monthly"
"If we fail to meet this, you'll receive a 10% service credit"

The relationship:

SLA (what we promise customers) < SLO (what we target internally) < actual performance
   99.9%                              99.95%                           99.98%

This gives us room to detect problems before customers are impacted.

Defining Meaningful SLOs

Bad SLOs are vague: "The system should be fast and reliable." Good SLOs are specific, measurable, and tied to user experience.

My SLO Selection Process

I ask these questions:

What do users care about? (not what's easy to measure)
What level of reliability is good enough? (perfection is impossible and expensive)
What can we realistically achieve? (based on current architecture)
What's the cost of improving? (diminishing returns after a point)

Example: Payment Processing Service

User expectation: "I can complete a purchase quickly and reliably"

Translation to SLOs:

service: payment-api
slos:
  - name: API Availability
    description: Percentage of successful API requests
    sli: (count(http_requests_total{status!~"5.."}) / count(http_requests_total)) * 100
    target: 99.95%
    window: 30d
    
  - name: Request Latency (P95)
    description: 95th percentile of request duration
    sli: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    target: 0.3  # 300ms
    window: 30d
    
  - name: Request Latency (P99)
    description: 99th percentile of request duration
    sli: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    target: 1.0  # 1000ms
    window: 30d
    
  - name: Payment Success Rate
    description: Percentage of payments that complete successfully
    sli: (count(payment_transactions{status="success"}) / count(payment_transactions)) * 100
    target: 99.9%
    window: 30d

Implementing SLIs with Prometheus

I instrument applications to expose metrics that Prometheus scrapes.

Instrumenting a Node.js Application

// src/metrics.ts
import * as promClient from 'prom-client';

// Create a Registry
export const register = new promClient.Registry();

// Add default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics({ register });

// Custom metrics
export const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1.0, 2.0, 5.0], // ms buckets
  registers: [register]
});

export const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [register]
});

export const paymentTransactionTotal = new promClient.Counter({
  name: 'payment_transactions_total',
  help: 'Total number of payment transactions',
  labelNames: ['status', 'payment_method'],
  registers: [register]
});

export const paymentTransactionDuration = new promClient.Histogram({
  name: 'payment_transaction_duration_seconds',
  help: 'Duration of payment transactions',
  labelNames: ['status', 'payment_method'],
  buckets: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
  registers: [register]
});

Middleware to Track Requests

// src/middleware/metrics.ts
import { Request, Response, NextFunction } from 'express';
import { httpRequestDuration, httpRequestTotal } from '../metrics';

export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = Date.now();
  
  // Track when response finishes
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000; // Convert to seconds
    const route = req.route?.path || req.path;
    
    // Record metrics
    httpRequestDuration.observe(
      { method: req.method, route, status: res.statusCode },
      duration
    );
    
    httpRequestTotal.inc({
      method: req.method,
      route,
      status: res.statusCode
    });
  });
  
  next();
}

Metrics Endpoint

// src/app.ts
import express from 'express';
import { register } from './metrics';
import { metricsMiddleware } from './middleware/metrics';

const app = express();

// Add metrics middleware to all routes
app.use(metricsMiddleware);

// Expose metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Your application routes
app.post('/api/payments', async (req, res) => {
  // Payment processing logic
});

app.listen(8080);

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'payment-api'
    kubernetes_sd_configs:
    - role: pod
      namespaces:
        names:
        - production
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      action: keep
      regex: payment-api
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2

Calculating Error Budgets

An error budget is how much unreliability you can tolerate before breaking your SLO.

Error Budget Calculation

If your SLO is 99.95% availability over 30 days:

Total minutes in 30 days: 30 × 24 × 60 = 43,200 minutes
Allowed downtime: 43,200 × 0.0005 = 21.6 minutes per month

Error budget = 21.6 minutes of downtime per month

If you've used 15 minutes so far this month, you have 6.6 minutes remaining.

Error Budget Policy

I implement policies based on error budget:

Error budget remaining > 90%:

Team focuses on feature development
Aggressive deployment frequency
Experiment with new technologies

Error budget remaining 50-90%:

Balanced approach
Normal deployment frequency
Standard risk tolerance

Error budget remaining 10-50%:

Increased caution
Reduce deployment frequency
Focus on stability improvements
More extensive testing

Error budget exhausted (<10%):

Freeze new feature releases
Focus 100% on reliability
Root cause analysis of incidents
Pay down technical debt
Only critical bug fixes and reliability improvements

Tracking Error Budget

// src/services/error-budget.ts
import { prometheusQuery } from './prometheus';

interface ErrorBudget {
  service: string;
  slo: number;
  window: number; // days
  remaining: number; // percentage
  status: 'healthy' | 'warning' | 'critical' | 'exhausted';
}

export async function calculateErrorBudget(
  service: string,
  slo: number,
  windowDays: number
): Promise<ErrorBudget> {
  // Query actual availability from Prometheus
  const query = `
    sum(rate(http_requests_total{service="${service}",status!~"5.."}[${windowDays}d])) /
    sum(rate(http_requests_total{service="${service}"}[${windowDays}d]))
  `;
  
  const result = await prometheusQuery(query);
  const actualAvailability = parseFloat(result[0].value[1]);
  
  // Calculate error budget
  const allowedFailureRate = 1 - slo;
  const actualFailureRate = 1 - actualAvailability;
  const errorBudgetUsed = actualFailureRate / allowedFailureRate;
  const errorBudgetRemaining = (1 - errorBudgetUsed) * 100;
  
  // Determine status
  let status: ErrorBudget['status'];
  if (errorBudgetRemaining > 90) status = 'healthy';
  else if (errorBudgetRemaining > 50) status = 'warning';
  else if (errorBudgetRemaining > 10) status = 'critical';
  else status = 'exhausted';
  
  return {
    service,
    slo,
    window: windowDays,
    remaining: errorBudgetRemaining,
    status
  };
}

Error Budget Dashboard

I create a Grafana dashboard showing error budget status:

{
  "dashboard": {
    "title": "Error Budget Dashboard",
    "panels": [
      {
        "title": "Error Budget Remaining",
        "targets": [
          {
            "expr": "(1 - ((1 - (sum(rate(http_requests_total{status!~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.9995))) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 10, "color": "orange" },
                { "value": 50, "color": "yellow" },
                { "value": 90, "color": "green" }
              ]
            }
          }
        }
      }
    ]
  }
}

Uptime Practices

Beyond measuring reliability, you need practices to maintain it.

Multi-Region Deployment

I deploy critical services across multiple AWS regions:

# terraform/multi-region/main.tf
module "us-west-2" {
  source = "../modules/regional-deployment"
  
  region           = "us-west-2"
  cluster_name     = "production-us-west-2"
  is_primary       = true
  failover_enabled = true
}

module "us-east-1" {
  source = "../modules/regional-deployment"
  
  region           = "us-east-1"
  cluster_name     = "production-us-east-1"
  is_primary       = false
  failover_enabled = true
}

resource "aws_route53_health_check" "primary" {
  fqdn              = "api-usw2.myapp.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.myapp.com"
  type    = "A"
  
  set_identifier = "us-west-2"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  health_check_id = aws_route53_health_check.primary.id
  
  alias {
    name                   = module.us-west-2.load_balancer_dns
    zone_id                = module.us-west-2.load_balancer_zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.myapp.com"
  type    = "A"
  
  set_identifier = "us-east-1"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  alias {
    name                   = module.us-east-1.load_balancer_dns
    zone_id                = module.us-east-1.load_balancer_zone_id
    evaluate_target_health = true
  }
}

Circuit Breakers

I implement circuit breakers to prevent cascading failures:

// src/lib/circuit-breaker.ts
export class CircuitBreaker {
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime?: Date;
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  
  constructor(
    private threshold: number = 5,
    private timeout: number = 60000, // ms
    private successThreshold: number = 2
  ) {}
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime!.getTime() > this.timeout) {
        this.state = 'half-open';
        this.successCount = 0;
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private onSuccess() {
    this.failureCount = 0;
    
    if (this.state === 'half-open') {
      this.successCount++;
      if (this.successCount >= this.successThreshold) {
        this.state = 'closed';
      }
    }
  }
  
  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = new Date();
    
    if (this.failureCount >= this.threshold) {
      this.state = 'open';
    }
  }
  
  getState() {
    return this.state;
  }
}

// Usage
const paymentServiceBreaker = new CircuitBreaker(5, 60000, 2);

async function callPaymentService(data: PaymentData) {
  return paymentServiceBreaker.execute(async () => {
    const response = await fetch('https://payment-service/api/charge', {
      method: 'POST',
      body: JSON.stringify(data)
    });
    
    if (!response.ok) {
      throw new Error(`Payment service returned ${response.status}`);
    }
    
    return response.json();
  });
}

Rate Limiting

I implement rate limiting to protect services from overload:

// src/middleware/rate-limit.ts
import rateLimit from 'express-rate-limit';
import RedisStore from 'rate-limit-redis';
import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

export const apiRateLimiter = rateLimit({
  store: new RedisStore({
    client: redis,
    prefix: 'rate-limit:',
  }),
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100, // Limit each IP to 100 requests per window
  message: 'Too many requests from this IP, please try again later.',
  standardHeaders: true,
  legacyHeaders: false,
});

export const paymentRateLimiter = rateLimit({
  store: new RedisStore({
    client: redis,
    prefix: 'rate-limit:payment:',
  }),
  windowMs: 60 * 1000, // 1 minute
  max: 5, // Limit to 5 payment attempts per minute
  keyGenerator: (req) => req.user?.id || req.ip, // Per-user limit
  message: 'Too many payment attempts, please try again later.',
});

// Apply to routes
app.use('/api/', apiRateLimiter);
app.use('/api/payments', paymentRateLimiter);

Graceful Degradation

I implement graceful degradation for non-critical features:

// src/services/recommendations.ts
async function getRecommendations(userId: string): Promise<Product[]> {
  try {
    // Try ML-based recommendations with timeout
    const recommendations = await Promise.race([
      mlService.getRecommendations(userId),
      new Promise((_, reject) => 
        setTimeout(() => reject(new Error('Timeout')), 2000)
      )
    ]) as Product[];
    
    return recommendations;
  } catch (error) {
    logger.warn('ML recommendations failed, falling back to popular products', {
      userId,
      error: error.message
    });
    
    // Fallback to simple popular products
    return await database.getPopularProducts(10);
  }
}

SLO Monitoring and Alerting

I configure alerts based on SLO burn rate—how fast we're consuming error budget.

Prometheus Alert Rules

# prometheus-rules/slo-alerts.yml
groups:
- name: slo-alerts
  interval: 1m
  rules:
  
  # Fast burn: 5% error budget consumed in 1 hour
  - alert: ErrorBudgetFastBurn
    expr: |
      (
        1 - (
          sum(rate(http_requests_total{status!~"5.."}[1h])) /
          sum(rate(http_requests_total[1h]))
        )
      ) > (1 - 0.9995) * 0.05
    for: 2m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High error rate detected - fast error budget burn"
      description: "{{ $value | humanizePercentage }} error rate in the last hour"
      
  # Slow burn: 10% error budget consumed in 6 hours  
  - alert: ErrorBudgetSlowBurn
    expr: |
      (
        1 - (
          sum(rate(http_requests_total{status!~"5.."}[6h])) /
          sum(rate(http_requests_total[6h]))
        )
      ) > (1 - 0.9995) * 0.10
    for: 15m
    labels:
      severity: warning
      team: platform
    annotations:
      summary: "Elevated error rate detected - slow error budget burn"
      description: "{{ $value | humanizePercentage }} error rate in the last 6 hours"
  
  # Latency SLO violation
  - alert: LatencySLOViolation
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      ) > 0.3
    for: 10m
    labels:
      severity: warning
      team: platform
    annotations:
      summary: "P95 latency exceeds SLO"
      description: "P95 latency is {{ $value }}s (SLO: 0.3s)"

Key Takeaways

SLIs measure, SLOs target, SLAs promise: Know the difference and set appropriate thresholds
Error budgets enable balanced risk-taking: Use them to make deployment decisions
Instrument everything: You can't improve what you don't measure
Multi-layered defense: Circuit breakers, rate limiting, retries, timeouts, failover
Graceful degradation: Non-critical features should fail without taking down the system
Alert on burn rate, not absolute values: Fast burns need immediate attention, slow burns need investigation

In the next part, we'll cover incident response and management—what to do when things go wrong despite all these precautions.

Previous: Part 5: Standardization and Reproducible Deployments Next: Part 7: Incident Response and Management

PreviousPart 5: Standardization and Reproducible Deployments NextPart 7: Incident Response and Management

Last updated 19 hours ago

hashtagThe Day We Learned Uptime Isn't Everything

hashtagUnderstanding SLIs, SLOs, and SLAs

hashtagSLI (Service Level Indicator)

hashtagSLO (Service Level Objective)

hashtagSLA (Service Level Agreement)

hashtagDefining Meaningful SLOs

hashtagMy SLO Selection Process

hashtagExample: Payment Processing Service

hashtagImplementing SLIs with Prometheus

hashtagInstrumenting a Node.js Application

hashtagMiddleware to Track Requests

hashtagMetrics Endpoint

hashtagPrometheus Configuration

hashtagCalculating Error Budgets

hashtagError Budget Calculation

hashtagError Budget Policy

hashtagTracking Error Budget

hashtagError Budget Dashboard

hashtagUptime Practices

hashtagMulti-Region Deployment

hashtagCircuit Breakers

hashtagRate Limiting

hashtagGraceful Degradation

hashtagSLO Monitoring and Alerting

hashtagPrometheus Alert Rules

hashtagKey Takeaways