Part 7: Capacity Planning, Performance, and Chaos Engineering

Part of the SRE Playbook series

What You'll Learn: This article covers how I load test the GoReliable services with k6, profile Go binaries in production using pprof, configure the Notification Worker's HPA to scale on NATS queue depth rather than CPU, and use Chaos Mesh to intentionally break things in staging to verify the system handles failures gracefully. The goal is to find limits before users do.

Finding Limits Before Users Do

After the SLO alerting and incident tooling from Parts 5 and 6 were in place, I had good reactive capability. But I had almost no idea where my limits were. I didn't know the maximum sustained order throughput before database connections saturated, or what happened to the notification queue if the email provider went down for 30 minutes.

This article covers the proactive side: deliberately finding those limits and either expanding them or building graceful degradation around them.

Load Testing with k6

k6 is my load testing tool of choice. It's scriptable in JavaScript, produces Prometheus-compatible metrics, and integrates well with my existing Grafana dashboards.

Order Creation Load Test

// tests/load/order-creation.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const orderErrors = new Rate('order_errors');
const orderDuration = new Trend('order_duration', true);

// Test configuration: ramp up to 100 VUs over 2 minutes, hold 5 minutes, ramp down
export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Steady state
    { duration: '1m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_failed: ['rate<0.01'],         // < 1% error rate
    http_req_duration: ['p(99)<500'],       // p99 < 500ms
    order_errors: ['rate<0.005'],           // < 0.5% order errors
  },
};

const BASE_URL = __ENV.BASE_URL || 'https://api.staging.go-reliable.dev';

export function setup() {
  // Authenticate and return a token for all VUs to share
  const res = http.post(`${BASE_URL}/auth/token`, JSON.stringify({
    client_id: __ENV.CLIENT_ID,
    client_secret: __ENV.CLIENT_SECRET,
  }), { headers: { 'Content-Type': 'application/json' } });

  check(res, { 'auth succeeded': (r) => r.status === 200 });
  return { token: res.json('access_token') };
}

export default function (data) {
  const payload = JSON.stringify({
    amount: Math.floor(Math.random() * 10000) + 100,  // Random amount 100-10100 cents
    currency: 'USD',
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${data.token}`,
    },
  };

  const res = http.post(`${BASE_URL}/api/v1/orders`, payload, params);

  const success = check(res, {
    'status is 201': (r) => r.status === 201,
    'has order id': (r) => r.json('id') !== null,
    'latency < 300ms': (r) => r.timings.duration < 300,
  });

  orderErrors.add(!success);
  orderDuration.add(res.timings.duration);

  sleep(0.5);  // Think time between requests per VU
}

Running this test:

What the Load Test Revealed

Running this at 100 concurrent users (my first attempt), I found two bottlenecks:

Bottleneck 1: Database connection pool at 80 VUs. The Order Service hit DB_MAX_OPEN_CONNS: 25 and started queueing connection requests. Response times climbed from 150ms to 800ms. Fix: increase MaxOpenConns to 50 and verify that PostgreSQL can handle the additional connections (it could — it was sized at max_connections=200).

Bottleneck 2: API Gateway CPU throttling at 100 VUs. The gateway was hitting its CPU limit and getting throttled by the Kubernetes CFS scheduler, adding latency variance. Fix: increase resources.limits.cpu from 500m to 1000m and let HPA scale horizontally.

After both fixes, the test passed its thresholds at 150 VUs. I didn't push further — I set my "known safe" production capacity at 100 VUs and configured the HPA to scale before approaching that limit.

Profiling Go Services in Production

Go's built-in pprof profiler is safe to run in production. I expose the pprof endpoint on the metrics port (not the main HTTP port) with IP-based access control.

Capturing Profiles

I use kubectl port-forward to access pprof from my laptop without exposing it publicly:

What pprof Found

Running a CPU profile during the load test revealed that 23% of Order Service CPU time was in JSON marshaling/unmarshaling. I was decoding the HTTP request body, passing the struct through three layers, then re-marshaling it for the database insert.

I refactored to pass the decoded struct directly through all layers (rather than re-encoding to JSON for the event publisher). CPU utilization dropped 18%.

The heap profile revealed a subtle issue: I was calling rows.Next() in a loop but had a code path that returned early without calling rows.Close(). Under load, this held open database cursors, contributing to the connection pool pressure. Fixing it (adding defer rows.Close() immediately after rows, err := db.Query(...)) recovered approximately 5 database connections.

Custom HPA for the Notification Worker

The Notification Worker should scale based on how many messages are waiting to be processed, not CPU. A three-message queue doesn't need more workers, regardless of CPU. A 50,000-message queue needs more workers even if current CPU is low.

I expose the pending message count as a Prometheus metric (Part 4), then use the Prometheus Adapter to make it available to the Kubernetes HPA.

Prometheus Adapter Configuration

HPA Using Custom Metric

With this HPA, when a batch of 4,000 notifications arrives (e.g., a marketing campaign), the worker scales from 1 to 8 instances within a few minutes and drains the queue without manual intervention.

Chaos Engineering with Chaos Mesh

Chaos Mesh is a chaos engineering platform that runs on Kubernetes. I run chaos experiments in the staging namespace to verify my reliability assumptions.

Installing Chaos Mesh

Chaos Mesh is deployed via ArgoCD like everything else:

Experiment 1: Order Service Pod Failure

Does the API Gateway handle Order Service pod restarts gracefully? My expectation: in-flight requests may fail, but new requests recover within 30 seconds as Kubernetes starts a replacement pod.

Result: Pass. During the 30-second failure window, error rate spiked to ~40% (the single pod was the only replica in staging). After pod recovery, error rate returned to 0 within 45 seconds. With 2 replicas in production and maxUnavailable: 0 in the deployment strategy, this would cause no errors at all.

Finding: The readiness probe correctly prevented the restarting pod from receiving traffic. The startup probe gave it time to complete health checks before getting traffic. The system behaved exactly as designed.

Experiment 2: Network Partition to Database

What happens to the Order Service when the database becomes unreachable?

Result: Partial fail. Order creation failed immediately (expected — there's no write-through cache). However, I discovered that the readiness probe was also failing (correctly) within 10 seconds, which removed the Order Service from the load balancer. But the API Gateway was returning 502 errors rather than meaningful 503s to clients.

Fix: I added a circuit breaker in the API Gateway's Order Service client, returning a 503 Service Temporarily Unavailable with a Retry-After: 10 header when the Order Service's health check fails, rather than propagating the connection error as a 502.

Experiment 3: CPU Stress on Notification Worker

Does the notification worker slow down message processing under CPU stress? I expected processing time to increase but not completely stall.

Result: Pass. Under 80% CPU stress, message processing throughput dropped from ~50 messages/sec to ~12 messages/sec. The pending queue grew, the HPA triggered, and within 90 seconds a second worker pod was running. The queue drained within 3 minutes after stress ended.

Finding: The custom HPA metric works correctly. The queue-depth-based scaling is more responsive than CPU-based would be in this scenario because the CPU metric was already saturated on the stressed pod — a CPU-based HPA would have scaled immediately but for the wrong reason.

Capacity Planning Summary

After this phase, I had a documented capacity profile:

Service
Max Sustained RPS
Bottleneck
Mitigation

API Gateway

500 RPS

CPU throttling

HPA scales at 70% CPU

Order Service

200 orders/min

DB connections

Pool size 50, HPA at 70% CPU

Notification Worker

50 msg/sec/pod

I/O bound (email API)

Custom HPA on queue depth

These numbers aren't impressive for a large system, but they're well above what I need, and more importantly — I know what they are. Knowing limits means I can plan for load events and not be surprised by them.

In Part 8, the series moves into the ML operations phase. I add a model training pipeline using KubeFlow on the same Kubernetes cluster, deploy model serving with KServe, and connect the ML Inference Gateway service to serve predictions.

Last updated