Part 7: Capacity Planning, Performance, and Chaos Engineering

Part of the SRE Playbook series

What You'll Learn: This article covers how I load test the GoReliable services with k6, profile Go binaries in production using pprof, configure the Notification Worker's HPA to scale on NATS queue depth rather than CPU, and use Chaos Mesh to intentionally break things in staging to verify the system handles failures gracefully. The goal is to find limits before users do.

Finding Limits Before Users Do

After the SLO alerting and incident tooling from Parts 5 and 6 were in place, I had good reactive capability. But I had almost no idea where my limits were. I didn't know the maximum sustained order throughput before database connections saturated, or what happened to the notification queue if the email provider went down for 30 minutes.

This article covers the proactive side: deliberately finding those limits and either expanding them or building graceful degradation around them.

Load Testing with k6

k6 is my load testing tool of choice. It's scriptable in JavaScript, produces Prometheus-compatible metrics, and integrates well with my existing Grafana dashboards.

Order Creation Load Test

// tests/load/order-creation.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const orderErrors = new Rate('order_errors');
const orderDuration = new Trend('order_duration', true);

// Test configuration: ramp up to 100 VUs over 2 minutes, hold 5 minutes, ramp down
export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Steady state
    { duration: '1m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_failed: ['rate<0.01'],         // < 1% error rate
    http_req_duration: ['p(99)<500'],       // p99 < 500ms
    order_errors: ['rate<0.005'],           // < 0.5% order errors
  },
};

const BASE_URL = __ENV.BASE_URL || 'https://api.staging.go-reliable.dev';

export function setup() {
  // Authenticate and return a token for all VUs to share
  const res = http.post(`${BASE_URL}/auth/token`, JSON.stringify({
    client_id: __ENV.CLIENT_ID,
    client_secret: __ENV.CLIENT_SECRET,
  }), { headers: { 'Content-Type': 'application/json' } });

  check(res, { 'auth succeeded': (r) => r.status === 200 });
  return { token: res.json('access_token') };
}

export default function (data) {
  const payload = JSON.stringify({
    amount: Math.floor(Math.random() * 10000) + 100,  // Random amount 100-10100 cents
    currency: 'USD',
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${data.token}`,
    },
  };

  const res = http.post(`${BASE_URL}/api/v1/orders`, payload, params);

  const success = check(res, {
    'status is 201': (r) => r.status === 201,
    'has order id': (r) => r.json('id') !== null,
    'latency < 300ms': (r) => r.timings.duration < 300,
  });

  orderErrors.add(!success);
  orderDuration.add(res.timings.duration);

  sleep(0.5);  // Think time between requests per VU
}

Running this test:

k6 run \
  -e BASE_URL=https://api.staging.go-reliable.dev \
  -e CLIENT_ID=$CLIENT_ID \
  -e CLIENT_SECRET=$CLIENT_SECRET \
  --out prometheus=http://prometheus-pushgateway:9091 \
  tests/load/order-creation.js

What the Load Test Revealed

Running this at 100 concurrent users (my first attempt), I found two bottlenecks:

Bottleneck 1: Database connection pool at 80 VUs. The Order Service hit DB_MAX_OPEN_CONNS: 25 and started queueing connection requests. Response times climbed from 150ms to 800ms. Fix: increase MaxOpenConns to 50 and verify that PostgreSQL can handle the additional connections (it could — it was sized at max_connections=200).

Bottleneck 2: API Gateway CPU throttling at 100 VUs. The gateway was hitting its CPU limit and getting throttled by the Kubernetes CFS scheduler, adding latency variance. Fix: increase resources.limits.cpu from 500m to 1000m and let HPA scale horizontally.

After both fixes, the test passed its thresholds at 150 VUs. I didn't push further — I set my "known safe" production capacity at 100 VUs and configured the HPA to scale before approaching that limit.

Profiling Go Services in Production

Go's built-in pprof profiler is safe to run in production. I expose the pprof endpoint on the metrics port (not the main HTTP port) with IP-based access control.

// cmd/order-service/main.go (pprof setup)
import (
    "net/http"
    _ "net/http/pprof"  // Register pprof handlers on DefaultServeMux
)

// Serve pprof on the metrics port (not exposed outside the cluster)
go func() {
    mux := http.NewServeMux()
    mux.Handle("/metrics", promhttp.Handler())
    // pprof handlers are registered on DefaultServeMux by the import above;
    // mount them explicitly here so they're on our mux, not the default one
    mux.HandleFunc("/debug/pprof/", http.DefaultServeMux.ServeHTTP)
    http.ListenAndServe(fmt.Sprintf(":%d", cfg.MetricsPort), mux)
}()

Capturing Profiles

I use kubectl port-forward to access pprof from my laptop without exposing it publicly:

# Forward the metrics port of the order-service pod
kubectl port-forward -n go-reliable-production \
  $(kubectl get pod -n go-reliable-production -l app=order-service -o jsonpath='{.items[0].metadata.name}') \
  9090:9090

# Capture a 30-second CPU profile while running the load test
go tool pprof -http=:6060 http://localhost:9090/debug/pprof/profile?seconds=30

# Capture heap allocation profile
go tool pprof -http=:6061 http://localhost:9090/debug/pprof/heap

# Check goroutine count (useful for detecting goroutine leaks)
curl http://localhost:9090/debug/pprof/goroutine?debug=1

What pprof Found

Running a CPU profile during the load test revealed that 23% of Order Service CPU time was in JSON marshaling/unmarshaling. I was decoding the HTTP request body, passing the struct through three layers, then re-marshaling it for the database insert.

I refactored to pass the decoded struct directly through all layers (rather than re-encoding to JSON for the event publisher). CPU utilization dropped 18%.

The heap profile revealed a subtle issue: I was calling rows.Next() in a loop but had a code path that returned early without calling rows.Close(). Under load, this held open database cursors, contributing to the connection pool pressure. Fixing it (adding defer rows.Close() immediately after rows, err := db.Query(...)) recovered approximately 5 database connections.

Custom HPA for the Notification Worker

The Notification Worker should scale based on how many messages are waiting to be processed, not CPU. A three-message queue doesn't need more workers, regardless of CPU. A 50,000-message queue needs more workers even if current CPU is low.

I expose the pending message count as a Prometheus metric (Part 4), then use the Prometheus Adapter to make it available to the Kubernetes HPA.

Prometheus Adapter Configuration

# infrastructure/prometheus-adapter/values.yaml
rules:
  custom:
    - seriesQuery: 'goreliable_notification_worker_nats_consumer_pending_messages'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "goreliable_notification_worker_nats_consumer_pending_messages"
        as: "nats_consumer_pending_messages_per_worker"
      metricsQuery: |
        sum(goreliable_notification_worker_nats_consumer_pending_messages) /
        count(up{job="notification-worker"})

HPA Using Custom Metric

# notification-worker/templates/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: notification-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: notification-worker
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: nats_consumer_pending_messages_per_worker
        target:
          type: AverageValue
          averageValue: "500"    # Scale up when > 500 pending messages per worker
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 600   # 10 minutes — wait for queue to drain

With this HPA, when a batch of 4,000 notifications arrives (e.g., a marketing campaign), the worker scales from 1 to 8 instances within a few minutes and drains the queue without manual intervention.

Chaos Engineering with Chaos Mesh

Chaos Mesh is a chaos engineering platform that runs on Kubernetes. I run chaos experiments in the staging namespace to verify my reliability assumptions.

Installing Chaos Mesh

Chaos Mesh is deployed via ArgoCD like everything else:

# infrastructure/chaos-mesh/values.yaml (simplified)
controllerManager:
  replicaCount: 1
dashboard:
  create: true
chaosDaemon:
  runtime: containerd
  socketPath: /run/containerd/containerd.sock

Experiment 1: Order Service Pod Failure

Does the API Gateway handle Order Service pod restarts gracefully? My expectation: in-flight requests may fail, but new requests recover within 30 seconds as Kubernetes starts a replacement pod.

# chaos/experiments/order-service-pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: order-service-pod-failure
  namespace: go-reliable-staging
spec:
  action: pod-failure
  mode: one                         # Fail one pod
  selector:
    namespaces:
      - go-reliable-staging
    labelSelectors:
      app: order-service
  duration: "30s"
  scheduler:
    cron: "@every 10m"              # Repeat every 10 minutes during testing window

Result: Pass. During the 30-second failure window, error rate spiked to ~40% (the single pod was the only replica in staging). After pod recovery, error rate returned to 0 within 45 seconds. With 2 replicas in production and maxUnavailable: 0 in the deployment strategy, this would cause no errors at all.

Finding: The readiness probe correctly prevented the restarting pod from receiving traffic. The startup probe gave it time to complete health checks before getting traffic. The system behaved exactly as designed.

Experiment 2: Network Partition to Database

What happens to the Order Service when the database becomes unreachable?

# chaos/experiments/db-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: db-network-partition
  namespace: go-reliable-staging
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - go-reliable-staging
    labelSelectors:
      app: order-service
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - go-reliable-staging
      labelSelectors:
        app: postgresql
  duration: "60s"

Result: Partial fail. Order creation failed immediately (expected — there's no write-through cache). However, I discovered that the readiness probe was also failing (correctly) within 10 seconds, which removed the Order Service from the load balancer. But the API Gateway was returning 502 errors rather than meaningful 503s to clients.

Fix: I added a circuit breaker in the API Gateway's Order Service client, returning a 503 Service Temporarily Unavailable with a Retry-After: 10 header when the Order Service's health check fails, rather than propagating the connection error as a 502.

Experiment 3: CPU Stress on Notification Worker

Does the notification worker slow down message processing under CPU stress? I expected processing time to increase but not completely stall.

# chaos/experiments/notification-worker-cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: notification-worker-cpu-stress
  namespace: go-reliable-staging
spec:
  mode: one
  selector:
    namespaces:
      - go-reliable-staging
    labelSelectors:
      app: notification-worker
  stressors:
    cpu:
      workers: 2
      load: 80          # 80% CPU stress
  duration: "120s"

Result: Pass. Under 80% CPU stress, message processing throughput dropped from ~50 messages/sec to ~12 messages/sec. The pending queue grew, the HPA triggered, and within 90 seconds a second worker pod was running. The queue drained within 3 minutes after stress ended.

Finding: The custom HPA metric works correctly. The queue-depth-based scaling is more responsive than CPU-based would be in this scenario because the CPU metric was already saturated on the stressed pod — a CPU-based HPA would have scaled immediately but for the wrong reason.

Capacity Planning Summary

After this phase, I had a documented capacity profile:

Service

Max Sustained RPS

Bottleneck

Mitigation

API Gateway

500 RPS

CPU throttling

HPA scales at 70% CPU

Order Service

200 orders/min

DB connections

Pool size 50, HPA at 70% CPU

Notification Worker

50 msg/sec/pod

I/O bound (email API)

Custom HPA on queue depth

These numbers aren't impressive for a large system, but they're well above what I need, and more importantly — I know what they are. Knowing limits means I can plan for load events and not be surprised by them.

In Part 8, the series moves into the ML operations phase. I add a model training pipeline using KubeFlow on the same Kubernetes cluster, deploy model serving with KServe, and connect the ML Inference Gateway service to serve predictions.

PreviousPart 6: Incident Management and On-Call Automation NextPart 8: MLOps with KubeFlow — Training Pipelines on Kubernetes

Last updated 4 days ago

hashtagFinding Limits Before Users Do

hashtagLoad Testing with k6

hashtagOrder Creation Load Test

hashtagWhat the Load Test Revealed

hashtagProfiling Go Services in Production

hashtagCapturing Profiles

hashtagWhat pprof Found

hashtagCustom HPA for the Notification Worker

hashtagPrometheus Adapter Configuration

hashtagHPA Using Custom Metric

hashtagChaos Engineering with Chaos Mesh

hashtagInstalling Chaos Mesh

hashtagExperiment 1: Order Service Pod Failure

hashtagExperiment 2: Network Partition to Database

hashtagExperiment 3: CPU Stress on Notification Worker

hashtagCapacity Planning Summary