Part 1: Building a Production-Ready Go Microservices Platform

Part of the SRE Playbook series

What You'll Learn: This article walks through how I structured a Go microservices project from scratch with production reliability in mind. You'll see the project layout, how I built an API gateway with chi, a PostgreSQL-backed order service, an async notification worker using NATS JetStream, and how I containerize everything with multi-stage Docker builds. This is the foundation that every subsequent article in this series runs on.

Why I Started This Project

I'd been working with Go for a few years and had accumulated a mental checklist of patterns I wished I'd applied from day one on earlier projects. Things like: proper graceful shutdown, context propagation through every function signature, structured logging from the start, and a Makefile that actually documents how to run things.

So I built the GoReliable platform as a personal reference implementation. It's not a toy β€” it runs real workloads β€” but it's also sized for a single engineer to fully understand and operate. That's an intentional constraint. Many reliability problems come from systems that grew beyond anyone's ability to reason about them.

The platform has four services:

  • API Gateway β€” the edge, handles auth, rate limiting, routing

  • Order Service β€” domain logic, talking to PostgreSQL

  • Notification Worker β€” async processing via NATS JetStream

  • ML Inference Gateway β€” proxies requests to model serving endpoints

This article focuses on the Go code and project structure. Deployment comes in Parts 2 and 3.

Project Structure

One of the first decisions I make on any Go project is the directory layout. I follow the standard cmd/, internal/, pkg/ convention, but I'm deliberate about what goes where.

go-reliable/
β”œβ”€β”€ cmd/
β”‚   β”œβ”€β”€ api-gateway/
β”‚   β”‚   └── main.go
β”‚   β”œβ”€β”€ order-service/
β”‚   β”‚   └── main.go
β”‚   β”œβ”€β”€ notification-worker/
β”‚   β”‚   └── main.go
β”‚   └── ml-gateway/
β”‚       └── main.go
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ gateway/
β”‚   β”‚   β”œβ”€β”€ handler.go
β”‚   β”‚   β”œβ”€β”€ middleware/
β”‚   β”‚   β”‚   β”œβ”€β”€ auth.go
β”‚   β”‚   β”‚   β”œβ”€β”€ ratelimit.go
β”‚   β”‚   β”‚   └── requestid.go
β”‚   β”‚   └── router.go
β”‚   β”œβ”€β”€ order/
β”‚   β”‚   β”œβ”€β”€ handler.go
β”‚   β”‚   β”œβ”€β”€ repository.go
β”‚   β”‚   └── service.go
β”‚   β”œβ”€β”€ notification/
β”‚   β”‚   β”œβ”€β”€ consumer.go
β”‚   β”‚   └── sender.go
β”‚   └── mlgateway/
β”‚       β”œβ”€β”€ client.go
β”‚       └── handler.go
β”œβ”€β”€ pkg/
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   └── config.go
β”‚   β”œβ”€β”€ database/
β”‚   β”‚   └── postgres.go
β”‚   β”œβ”€β”€ health/
β”‚   β”‚   └── health.go
β”‚   β”œβ”€β”€ logger/
β”‚   β”‚   └── logger.go
β”‚   └── telemetry/
β”‚       └── otel.go
β”œβ”€β”€ deployments/
β”‚   β”œβ”€β”€ helm/
β”‚   └── argocd/
β”œβ”€β”€ Makefile
β”œβ”€β”€ go.mod
└── go.sum

The separation between internal/ and pkg/ is intentional. internal/ is service-specific code that I never plan to reuse. pkg/ is the cross-cutting infrastructure β€” config loading, database helpers, the logger β€” that all four services share.

Shared Infrastructure (pkg/)

Config Loading

I use environment variables for all configuration, loaded via envconfig. I avoid YAML config files for services because they make secrets management harder and create a parallel source of truth alongside Kubernetes ConfigMaps.

Structured Logger

Every service uses the same logger setup from pkg/logger. I chose zerolog because it's zero-allocation, fast, and produces JSON by default β€” which Loki and any log aggregator can parse without configuration.

One pattern I rely on heavily: passing the logger through context so that every function downstream can emit a log event that carries the same request ID and trace ID without passing the logger explicitly.

Database Connection

The Ping at startup is deliberate. I want the service to fail its readiness probe and not enter the load balancer if the database isn't reachable. It surfaces configuration problems immediately.

The API Gateway

The gateway is the entry point for all external traffic. Its responsibilities are intentionally narrow: authentication, rate limiting, request ID injection, and proxying to the right downstream service.

Router Setup

Request ID Middleware

Request IDs are the minimum viable correlation mechanism. Every log event, every trace span, and every downstream call carries the same ID. This makes following a single request through the logs feasible without a full distributed tracing setup (though we add that in Part 4).

Rate Limiting Middleware

I use a token bucket rate limiter with per-client IP limiting. The limiter state is intentionally in-memory β€” for a single-instance gateway this is fine; for multi-instance deployments you'd use Redis, but I deliberately kept this simple for the personal project.

The Order Service

The Order Service owns the order domain. I use the repository pattern to isolate database access from business logic β€” this matters for testing and for the reliability patterns in Part 7.

A decision worth explaining: I publish the notification event in a goroutine and don't fail the order if publishing fails. Notifications are best-effort. I'd rather complete the order and miss a notification than fail the order because NATS is temporarily unavailable. This is a conscious reliability trade-off.

The Notification Worker

The worker consumes from NATS JetStream. JetStream gives me at-least-once delivery semantics and durable subscriptions β€” if the worker restarts, it picks up where it left off.

The retry configuration is not arbitrary. I picked 5 retries with exponential backoff based on the expected recovery time of an email provider outage β€” short enough to deliver notifications promptly, long enough not to flood a degraded downstream during recovery.

Graceful Shutdown

Every service needs to handle SIGTERM cleanly. Kubernetes sends SIGTERM before killing the pod, and I have a 30-second termination grace period. If the service doesn't handle it, in-flight requests get dropped.

Multi-Stage Docker Build

The Dockerfile for each service follows the same pattern. The build stage compiles the binary; the runtime stage is minimal β€” just the binary and Alpine.

I use distroless/static rather than alpine for the runtime image. There's no shell, which reduces the attack surface. The tradeoff is that debugging requires kubectl exec with a debug container β€” acceptable for a personal project, and good practice to get used to.

The Makefile

A Makefile serves as the project's documentation for "how to do things". New contributors (or future me) shouldn't need to read docs to run tests or build images.

The help target using grep is a pattern I use on every project. It self-documents the Makefile without maintaining a separate README section.

What's Not in This Article

I deliberately left out three things that each get their own dedicated treatment:

  1. Health check handler implementation β€” pkg/health/health.go β€” covered in Part 2 where I explain how Kubernetes uses liveness vs readiness vs startup probes differently

  2. Prometheus instrumentation β€” the customMiddleware.Metrics reference in the router β€” covered in Part 4

  3. OpenTelemetry setup β€” pkg/telemetry/otel.go β€” covered in Part 4

Where We Are

At this point I have four Go services that compile, have tests, and run locally against Docker Compose dependencies. The code is structured so that:

  • Every external dependency is behind an interface (testable without an actual database or NATS)

  • Every function that does I/O takes a context.Context as its first argument

  • Shutdown is handled cleanly

  • Logging emits structured JSON from the start

In Part 2, I package these services into Helm charts and deploy them to Kubernetes with proper health checks, resource limits, and multi-environment values.

Last updated