Part 2: Kubernetes Deployment with Helm Charts

Part of the SRE Playbook series

What You'll Learn: This article covers how I package the GoReliable services into Helm charts β€” one chart per service, a shared library chart for common templates, and per-environment values files. You'll see how I configure health probes based on how Go exposes them, set resource requests and limits derived from actual profiling data, add horizontal pod autoscaler configuration, and manage database migrations as Helm hooks. By the end, you'll have a Helm chart structure you can adapt for your own Go services.

The Configuration Problem I Kept Hitting

Before I adopted Helm for this project, I was managing Kubernetes manifests as raw YAML. That worked fine for one environment, but the moment I added a staging environment alongside production, I had duplicated YAML files drifting apart. Staging would get a fix, production wouldn't, and I'd spend 30 minutes debugging a deployment difference rather than an actual bug.

Helm solves this by making differences explicit. The base chart defines the shape of the deployment; values files define what varies per environment. When I look at values.staging.yaml, I see exactly what's different from values.production.yaml. There's no guesswork.

For background on Helm fundamentals, the Helm 101 series covers chart structure and templating in depth. This article focuses on patterns specific to Go microservices and reliability.

Chart Structure

I use one Helm chart per service, plus a shared library chart for templates used by all four services.

deployments/helm/
β”œβ”€β”€ charts/
β”‚   └── go-reliable-lib/          # Library chart β€” not deployed directly
β”‚       β”œβ”€β”€ Chart.yaml
β”‚       └── templates/
β”‚           β”œβ”€β”€ _deployment.tpl
β”‚           β”œβ”€β”€ _service.tpl
β”‚           β”œβ”€β”€ _hpa.tpl
β”‚           └── _pdb.tpl
β”œβ”€β”€ api-gateway/
β”‚   β”œβ”€β”€ Chart.yaml
β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”‚   β”œβ”€β”€ service.yaml
β”‚   β”‚   β”œβ”€β”€ hpa.yaml
β”‚   β”‚   β”œβ”€β”€ pdb.yaml
β”‚   β”‚   └── configmap.yaml
β”‚   β”œβ”€β”€ values.yaml              # defaults
β”‚   β”œβ”€β”€ values.staging.yaml
β”‚   └── values.production.yaml
β”œβ”€β”€ order-service/
β”‚   β”œβ”€β”€ Chart.yaml
β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”‚   β”œβ”€β”€ service.yaml
β”‚   β”‚   β”œβ”€β”€ hpa.yaml
β”‚   β”‚   β”œβ”€β”€ pdb.yaml
β”‚   β”‚   β”œβ”€β”€ configmap.yaml
β”‚   β”‚   └── migration-job.yaml   # Helm pre-upgrade hook
β”‚   β”œβ”€β”€ values.yaml
β”‚   β”œβ”€β”€ values.staging.yaml
β”‚   └── values.production.yaml
β”œβ”€β”€ notification-worker/
β”‚   └── ...
└── ml-gateway/
    └── ...

The Deployment Template

The deployment template is the most important. Let me walk through the decisions I made.

There are three deliberate choices here I want to explain:

maxUnavailable: 0 β€” During a rolling update, I never want fewer than the desired number of pods. This means Kubernetes brings up the new version before it brings down the old one. The trade-off is that you briefly run replicaCount + 1 pods. For my workloads, that's acceptable.

ConfigMap checksum annotation β€” Kubernetes doesn't restart pods when ConfigMaps change. By adding a checksum of the ConfigMap as an annotation, any change to the ConfigMap changes the annotation, which triggers a rolling restart. I discovered this behavior the hard way when an environment variable change didn't take effect and I spent an hour investigating the wrong thing.

Startup probe with high failureThreshold β€” The startup probe is specifically for the window between "container started" and "application ready to accept traffic". I give it 150 seconds because the Order Service sometimes takes longer to start on first deploy when it's running database migrations. Without the startup probe, the liveness probe would kill the pod before it finished starting.

Health Check Handler

The health check implementation in Go needs to distinguish between liveness and readiness. These are meaningfully different concepts.

The separation matters in production. If I roll out a bad database query that causes timeouts, I want the readiness probe to fail (stop sending traffic) but the liveness probe to succeed (don't restart the pod β€” I might need to exec into it to debug). Failing liveness on a dependency check is a common mistake that causes cascading restarts.

Resource Requests and Limits

Resource requests and limits are the most commonly misconfigured aspect of Kubernetes deployments I've seen. Either they're not set at all (cluster scheduling chaos), or they're guessed and wildly inaccurate.

I derived my values using Go's built-in profiling tools. I ran each service under realistic load using k6 (covered more in Part 7) and captured pprof profiles.

API Gateway (values.yaml β€” defaults for development):

API Gateway (values.production.yaml):

A note on CPU limits: I've read arguments both ways. In Kubernetes, CPU limits create CPU throttling via CFS quotas, which can hurt latency even when the node has spare capacity. For my Go services (mostly I/O bound at the gateway, not CPU bound), I set generous CPU limits and rely on HPA to scale horizontally rather than squeezing CPU. For memory limits, I set them tightly because memory limits determine whether the container gets OOM-killed β€” I'd rather OOM-kill a single pod and have Kubernetes restart it than have a memory leak quietly grow until the node is pressured.

Horizontal Pod Autoscaler

The scaleDown.stabilizationWindowSeconds: 300 prevents flapping. Without it, a brief traffic dip would scale down pods, then a traffic spike would scale them back up, thrashing the cluster. Five minutes is long enough to absorb normal traffic variance.

Pod Disruption Budget

PDBs prevent Kubernetes from evicting too many pods simultaneously during node drains (maintenance, upgrades). Without a PDB, a node drain could take down all my pods at once.

For production with minReplicas: 2, setting minAvailable: 1 means Kubernetes can evict at most one pod at a time. Combined with maxUnavailable: 0 in the rollout strategy, I have reasonable protection against both rolling updates and voluntary disruptions.

Database Migration Helm Hook

The Order Service needs database migrations to run before the new deployment starts. I use a Kubernetes Job as a Helm pre-upgrade hook.

The hook-delete-policy: before-hook-creation means Helm automatically deletes the old migration job before creating a new one. The hook-succeeded part means a successful job is also cleaned up. This keeps the namespace tidy.

I append .Release.Revision to the job name because job names must be unique β€” without it, Helm can't create a new job if an old one with the same name still exists.

Per-Environment Values

Here's the actual structure of how values differ across environments:

The values files don't override the image tag β€” that's set by the CI pipeline at deploy time: helm upgrade --set image.tag=$GIT_SHA. This keeps image tags out of git (they'd change on every commit and create noisy diffs).

Deploying Manually vs GitOps

At this stage, I can deploy manually:

But manual deploys don't scale and don't provide the audit trail I want. In Part 3, I wire this up to ArgoCD so every commit to the GitOps repository automatically drives the cluster state β€” no manual commands needed.

What I Got Wrong Initially

First attempt: I set memory.limit equal to memory.request for all services. This caused periodic OOM kills when traffic spiked and Go's garbage collector needed temporary headroom. The fix was setting limits to 2-3Γ— the steady-state RSS.

Second attempt: I enabled HPA on the Notification Worker. The worker scales based on message queue depth, not CPU β€” CPU autoscaling caused it to scale down while there were still thousands of unprocessed messages. In Part 7, I show how to configure HPA with a custom Prometheus metric (NATS consumer pending count) instead.

Third attempt: I initially didn't use a startup probe, just a readiness probe with initialDelaySeconds: 60. This caused issues when Order Service migration took longer than expected on a fresh cluster. The startup probe with high failureThreshold handles variable startup times cleanly.

Where We Are

The services are packaged as Helm charts with:

  • Health probes correctly configured for Go services

  • Resources derived from profiling, not guessing

  • HPA and PDB for production reliability

  • Database migrations as Helm hooks

  • Per-environment values files

In Part 3, I set up ArgoCD to deploy these charts automatically from a GitOps repository, wire up a CI pipeline that builds images and updates the Helm values, and configure secret management with External Secrets Operator.

Last updated