Part 1: Building a Production-Ready Go Microservices Platform

Part of the SRE Playbook series

What You'll Learn: This article walks through how I structured a Go microservices project from scratch with production reliability in mind. You'll see the project layout, how I built an API gateway with chi, a PostgreSQL-backed order service, an async notification worker using NATS JetStream, and how I containerize everything with multi-stage Docker builds. This is the foundation that every subsequent article in this series runs on.

Why I Started This Project

I'd been working with Go for a few years and had accumulated a mental checklist of patterns I wished I'd applied from day one on earlier projects. Things like: proper graceful shutdown, context propagation through every function signature, structured logging from the start, and a Makefile that actually documents how to run things.

So I built the GoReliable platform as a personal reference implementation. It's not a toy — it runs real workloads — but it's also sized for a single engineer to fully understand and operate. That's an intentional constraint. Many reliability problems come from systems that grew beyond anyone's ability to reason about them.

The platform has four services:

API Gateway — the edge, handles auth, rate limiting, routing
Order Service — domain logic, talking to PostgreSQL
Notification Worker — async processing via NATS JetStream
ML Inference Gateway — proxies requests to model serving endpoints

This article focuses on the Go code and project structure. Deployment comes in Parts 2 and 3.

Project Structure

One of the first decisions I make on any Go project is the directory layout. I follow the standard cmd/, internal/, pkg/ convention, but I'm deliberate about what goes where.

go-reliable/
├── cmd/
│   ├── api-gateway/
│   │   └── main.go
│   ├── order-service/
│   │   └── main.go
│   ├── notification-worker/
│   │   └── main.go
│   └── ml-gateway/
│       └── main.go
├── internal/
│   ├── gateway/
│   │   ├── handler.go
│   │   ├── middleware/
│   │   │   ├── auth.go
│   │   │   ├── ratelimit.go
│   │   │   └── requestid.go
│   │   └── router.go
│   ├── order/
│   │   ├── handler.go
│   │   ├── repository.go
│   │   └── service.go
│   ├── notification/
│   │   ├── consumer.go
│   │   └── sender.go
│   └── mlgateway/
│       ├── client.go
│       └── handler.go
├── pkg/
│   ├── config/
│   │   └── config.go
│   ├── database/
│   │   └── postgres.go
│   ├── health/
│   │   └── health.go
│   ├── logger/
│   │   └── logger.go
│   └── telemetry/
│       └── otel.go
├── deployments/
│   ├── helm/
│   └── argocd/
├── Makefile
├── go.mod
└── go.sum

The separation between internal/ and pkg/ is intentional. internal/ is service-specific code that I never plan to reuse. pkg/ is the cross-cutting infrastructure — config loading, database helpers, the logger — that all four services share.

Shared Infrastructure (`pkg/`)

Config Loading

I use environment variables for all configuration, loaded via envconfig. I avoid YAML config files for services because they make secrets management harder and create a parallel source of truth alongside Kubernetes ConfigMaps.

// pkg/config/config.go
package config

import (
    "github.com/kelseyhightower/envconfig"
)

type Base struct {
    ServiceName string `envconfig:"SERVICE_NAME" required:"true"`
    Environment string `envconfig:"ENVIRONMENT" default:"development"`
    LogLevel    string `envconfig:"LOG_LEVEL"    default:"info"`
    Port        int    `envconfig:"PORT"         default:"8080"`
    MetricsPort int    `envconfig:"METRICS_PORT" default:"9090"`
}

type DatabaseConfig struct {
    DSN             string `envconfig:"DATABASE_URL"          required:"true"`
    MaxOpenConns    int    `envconfig:"DB_MAX_OPEN_CONNS"     default:"25"`
    MaxIdleConns    int    `envconfig:"DB_MAX_IDLE_CONNS"     default:"5"`
    ConnMaxLifetime int    `envconfig:"DB_CONN_MAX_LIFETIME"  default:"300"` // seconds
}

type NATSConfig struct {
    URL     string `envconfig:"NATS_URL"      required:"true"`
    Stream  string `envconfig:"NATS_STREAM"   default:"ORDERS"`
    Subject string `envconfig:"NATS_SUBJECT"  default:"orders.notifications"`
}

func LoadBase() (*Base, error) {
    var cfg Base
    if err := envconfig.Process("", &cfg); err != nil {
        return nil, err
    }
    return &cfg, nil
}

func LoadDatabase() (*DatabaseConfig, error) {
    var cfg DatabaseConfig
    if err := envconfig.Process("", &cfg); err != nil {
        return nil, err
    }
    return &cfg, nil
}

Structured Logger

Every service uses the same logger setup from pkg/logger. I chose zerolog because it's zero-allocation, fast, and produces JSON by default — which Loki and any log aggregator can parse without configuration.

// pkg/logger/logger.go
package logger

import (
    "os"
    "time"

    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

func New(level, serviceName, environment string) zerolog.Logger {
    lvl, err := zerolog.ParseLevel(level)
    if err != nil {
        lvl = zerolog.InfoLevel
    }

    zerolog.TimeFieldFormat = time.RFC3339Nano

    logger := zerolog.New(os.Stdout).
        Level(lvl).
        With().
        Timestamp().
        Str("service", serviceName).
        Str("env", environment).
        Logger()

    // Set as global logger so zerolog.Ctx(ctx) works throughout the codebase
    log.Logger = logger

    return logger
}

One pattern I rely on heavily: passing the logger through context so that every function downstream can emit a log event that carries the same request ID and trace ID without passing the logger explicitly.

// Attach logger with request context to context value
func WithLogger(ctx context.Context, logger zerolog.Logger) context.Context {
    return logger.WithContext(ctx)
}

// Retrieve from context anywhere in the call stack
func FromContext(ctx context.Context) *zerolog.Logger {
    return zerolog.Ctx(ctx)
}

Database Connection

// pkg/database/postgres.go
package database

import (
    "context"
    "fmt"
    "time"

    "github.com/jackc/pgx/v5/pgxpool"
)

func NewPool(ctx context.Context, cfg *config.DatabaseConfig) (*pgxpool.Pool, error) {
    poolCfg, err := pgxpool.ParseConfig(cfg.DSN)
    if err != nil {
        return nil, fmt.Errorf("parse database config: %w", err)
    }

    poolCfg.MaxConns = int32(cfg.MaxOpenConns)
    poolCfg.MinConns = int32(cfg.MaxIdleConns)
    poolCfg.MaxConnLifetime = time.Duration(cfg.ConnMaxLifetime) * time.Second
    poolCfg.HealthCheckPeriod = 30 * time.Second

    pool, err := pgxpool.NewWithConfig(ctx, poolCfg)
    if err != nil {
        return nil, fmt.Errorf("create pool: %w", err)
    }

    // Verify connectivity at startup — fail fast rather than discover at first request
    if err := pool.Ping(ctx); err != nil {
        return nil, fmt.Errorf("ping database: %w", err)
    }

    return pool, nil
}

The Ping at startup is deliberate. I want the service to fail its readiness probe and not enter the load balancer if the database isn't reachable. It surfaces configuration problems immediately.

The API Gateway

The gateway is the entry point for all external traffic. Its responsibilities are intentionally narrow: authentication, rate limiting, request ID injection, and proxying to the right downstream service.

Router Setup

// internal/gateway/router.go
package gateway

import (
    "net/http"

    "github.com/go-chi/chi/v5"
    "github.com/go-chi/chi/v5/middleware"
    customMiddleware "github.com/htunn/go-reliable/internal/gateway/middleware"
)

func NewRouter(h *Handler) http.Handler {
    r := chi.NewRouter()

    // Global middleware — applied to every request
    r.Use(middleware.RealIP)
    r.Use(middleware.Recoverer)
    r.Use(customMiddleware.RequestID)
    r.Use(customMiddleware.Logger)
    r.Use(customMiddleware.Metrics) // Prometheus instrumentation — covered in Part 4

    // Auth-protected routes
    r.Group(func(r chi.Router) {
        r.Use(customMiddleware.Auth(h.jwtVerifier))
        r.Use(customMiddleware.RateLimiter(h.limiter))

        r.Post("/api/v1/orders", h.CreateOrder)
        r.Get("/api/v1/orders/{id}", h.GetOrder)
        r.Get("/api/v1/orders", h.ListOrders)
    })

    // Health and metrics — no auth
    r.Get("/healthz/live", h.Liveness)
    r.Get("/healthz/ready", h.Readiness)

    return r
}

Request ID Middleware

Request IDs are the minimum viable correlation mechanism. Every log event, every trace span, and every downstream call carries the same ID. This makes following a single request through the logs feasible without a full distributed tracing setup (though we add that in Part 4).

// internal/gateway/middleware/requestid.go
package middleware

import (
    "context"
    "net/http"

    "github.com/google/uuid"
    "github.com/rs/zerolog/log"
)

type contextKey string

const RequestIDKey contextKey = "requestID"

func RequestID(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        requestID := r.Header.Get("X-Request-ID")
        if requestID == "" {
            requestID = uuid.New().String()
        }

        // Set on response so callers can correlate
        w.Header().Set("X-Request-ID", requestID)

        // Inject into context and into zerolog context
        ctx := context.WithValue(r.Context(), RequestIDKey, requestID)
        logger := log.With().Str("request_id", requestID).Logger()
        ctx = logger.WithContext(ctx)

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Rate Limiting Middleware

I use a token bucket rate limiter with per-client IP limiting. The limiter state is intentionally in-memory — for a single-instance gateway this is fine; for multi-instance deployments you'd use Redis, but I deliberately kept this simple for the personal project.

// internal/gateway/middleware/ratelimit.go
package middleware

import (
    "net/http"
    "sync"
    "time"

    "golang.org/x/time/rate"
)

type IPRateLimiter struct {
    ips   map[string]*rateLimiterEntry
    mu    sync.RWMutex
    rate  rate.Limit
    burst int
}

type rateLimiterEntry struct {
    limiter  *rate.Limiter
    lastSeen time.Time
}

func NewIPRateLimiter(r rate.Limit, b int) *IPRateLimiter {
    rl := &IPRateLimiter{
        ips:   make(map[string]*rateLimiterEntry),
        rate:  r,
        burst: b,
    }
    // Clean up stale entries every minute
    go rl.cleanupLoop()
    return rl
}

func (rl *IPRateLimiter) getLimiter(ip string) *rate.Limiter {
    rl.mu.Lock()
    defer rl.mu.Unlock()

    entry, exists := rl.ips[ip]
    if !exists {
        entry = &rateLimiterEntry{
            limiter: rate.NewLimiter(rl.rate, rl.burst),
        }
        rl.ips[ip] = entry
    }
    entry.lastSeen = time.Now()
    return entry.limiter
}

func (rl *IPRateLimiter) cleanupLoop() {
    ticker := time.NewTicker(time.Minute)
    defer ticker.Stop()
    for range ticker.C {
        rl.mu.Lock()
        for ip, entry := range rl.ips {
            if time.Since(entry.lastSeen) > 5*time.Minute {
                delete(rl.ips, ip)
            }
        }
        rl.mu.Unlock()
    }
}

func RateLimiter(rl *IPRateLimiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            limiter := rl.getLimiter(r.RemoteAddr)
            if !limiter.Allow() {
                http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
}

The Order Service

The Order Service owns the order domain. I use the repository pattern to isolate database access from business logic — this matters for testing and for the reliability patterns in Part 7.

// internal/order/service.go
package order

import (
    "context"
    "fmt"
    "time"

    "github.com/google/uuid"
    "github.com/rs/zerolog"
)

type Status string

const (
    StatusPending    Status = "pending"
    StatusProcessing Status = "processing"
    StatusCompleted  Status = "completed"
    StatusFailed     Status = "failed"
)

type Order struct {
    ID          uuid.UUID `db:"id"           json:"id"`
    UserID      uuid.UUID `db:"user_id"      json:"user_id"`
    Amount      int64     `db:"amount"       json:"amount"`        // cents
    Currency    string    `db:"currency"     json:"currency"`
    Status      Status    `db:"status"       json:"status"`
    CreatedAt   time.Time `db:"created_at"   json:"created_at"`
    UpdatedAt   time.Time `db:"updated_at"   json:"updated_at"`
}

type Repository interface {
    Create(ctx context.Context, order *Order) error
    GetByID(ctx context.Context, id uuid.UUID) (*Order, error)
    UpdateStatus(ctx context.Context, id uuid.UUID, status Status) error
    ListByUserID(ctx context.Context, userID uuid.UUID, limit, offset int) ([]*Order, error)
}

type Service struct {
    repo   Repository
    events EventPublisher
}

type EventPublisher interface {
    Publish(ctx context.Context, subject string, data []byte) error
}

func NewService(repo Repository, events EventPublisher) *Service {
    return &Service{repo: repo, events: events}
}

func (s *Service) CreateOrder(ctx context.Context, userID uuid.UUID, amount int64, currency string) (*Order, error) {
    logger := zerolog.Ctx(ctx)

    order := &Order{
        ID:       uuid.New(),
        UserID:   userID,
        Amount:   amount,
        Currency: currency,
        Status:   StatusPending,
    }

    if err := s.repo.Create(ctx, order); err != nil {
        return nil, fmt.Errorf("create order: %w", err)
    }

    logger.Info().
        Str("order_id", order.ID.String()).
        Str("user_id", userID.String()).
        Int64("amount", amount).
        Msg("order created")

    // Publish event for notification worker — non-blocking, fire and forget
    // The notification is not critical path; we don't fail the order if this fails
    go func() {
        payload := fmt.Sprintf(`{"order_id":"%s","user_id":"%s","amount":%d}`,
            order.ID, userID, amount)
        if err := s.events.Publish(context.Background(), "orders.created", []byte(payload)); err != nil {
            // Log the failure but don't propagate — notification is best-effort
            logger.Warn().Err(err).Str("order_id", order.ID.String()).
                Msg("failed to publish order created event")
        }
    }()

    return order, nil
}

A decision worth explaining: I publish the notification event in a goroutine and don't fail the order if publishing fails. Notifications are best-effort. I'd rather complete the order and miss a notification than fail the order because NATS is temporarily unavailable. This is a conscious reliability trade-off.

The Notification Worker

The worker consumes from NATS JetStream. JetStream gives me at-least-once delivery semantics and durable subscriptions — if the worker restarts, it picks up where it left off.

// internal/notification/consumer.go
package notification

import (
    "context"
    "encoding/json"
    "time"

    "github.com/nats-io/nats.go"
    "github.com/nats-io/nats.go/jetstream"
    "github.com/rs/zerolog"
)

type OrderCreatedEvent struct {
    OrderID string `json:"order_id"`
    UserID  string `json:"user_id"`
    Amount  int64  `json:"amount"`
}

type Consumer struct {
    js      jetstream.JetStream
    sender  *Sender
    stream  string
    subject string
}

func NewConsumer(js jetstream.JetStream, sender *Sender, stream, subject string) *Consumer {
    return &Consumer{js: js, sender: sender, stream: stream, subject: subject}
}

func (c *Consumer) Start(ctx context.Context) error {
    logger := zerolog.Ctx(ctx)

    cons, err := c.js.CreateOrUpdateConsumer(ctx, c.stream, jetstream.ConsumerConfig{
        Durable:       "notification-worker",
        FilterSubject: c.subject,
        AckPolicy:     jetstream.AckExplicitPolicy,
        MaxDeliver:    5,                       // retry up to 5 times
        AckWait:       30 * time.Second,        // NAK if not acked in 30s
        BackOff:       []time.Duration{         // exponential backoff on retry
            5 * time.Second,
            15 * time.Second,
            30 * time.Second,
            60 * time.Second,
        },
    })
    if err != nil {
        return fmt.Errorf("create consumer: %w", err)
    }

    logger.Info().Str("stream", c.stream).Str("subject", c.subject).Msg("notification consumer started")

    for {
        select {
        case <-ctx.Done():
            return nil
        default:
            msg, err := cons.Next(jetstream.FetchMaxWait(5 * time.Second))
            if err != nil {
                if err == jetstream.ErrNoMessages || err == nats.ErrTimeout {
                    continue
                }
                logger.Error().Err(err).Msg("fetch message error")
                continue
            }

            if err := c.processMessage(ctx, msg); err != nil {
                logger.Error().Err(err).Msg("process message failed — will be redelivered")
                msg.Nak()
                continue
            }

            msg.Ack()
        }
    }
}

func (c *Consumer) processMessage(ctx context.Context, msg jetstream.Msg) error {
    var event OrderCreatedEvent
    if err := json.Unmarshal(msg.Data(), &event); err != nil {
        // Can't parse — don't retry, just ack and drop
        zerolog.Ctx(ctx).Error().Err(err).Msg("unparseable message, dropping")
        msg.Ack()
        return nil
    }

    return c.sender.SendOrderConfirmation(ctx, event)
}

The retry configuration is not arbitrary. I picked 5 retries with exponential backoff based on the expected recovery time of an email provider outage — short enough to deliver notifications promptly, long enough not to flood a degraded downstream during recovery.

Graceful Shutdown

Every service needs to handle SIGTERM cleanly. Kubernetes sends SIGTERM before killing the pod, and I have a 30-second termination grace period. If the service doesn't handle it, in-flight requests get dropped.

// cmd/api-gateway/main.go
package main

import (
    "context"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/rs/zerolog/log"
    "github.com/htunn/go-reliable/internal/gateway"
    "github.com/htunn/go-reliable/pkg/config"
    "github.com/htunn/go-reliable/pkg/logger"
    "github.com/htunn/go-reliable/pkg/telemetry"
)

func main() {
    cfg, err := config.LoadBase()
    if err != nil {
        log.Fatal().Err(err).Msg("load config")
    }

    log := logger.New(cfg.LogLevel, cfg.ServiceName, cfg.Environment)

    // OpenTelemetry setup — covered in Part 4
    shutdown, err := telemetry.Setup(context.Background(), cfg.ServiceName)
    if err != nil {
        log.Fatal().Err(err).Msg("setup telemetry")
    }
    defer shutdown(context.Background())

    router := gateway.NewRouter(/* ... dependencies ... */)

    server := &http.Server{
        Addr:         fmt.Sprintf(":%d", cfg.Port),
        Handler:      router,
        ReadTimeout:  10 * time.Second,
        WriteTimeout: 30 * time.Second,
        IdleTimeout:  120 * time.Second,
    }

    // Start serving in a goroutine so we can block on the signal channel below
    serverErr := make(chan error, 1)
    go func() {
        log.Info().Int("port", cfg.Port).Msg("api gateway started")
        if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            serverErr <- err
        }
    }()

    // Wait for either a termination signal or a fatal server error
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)

    select {
    case err := <-serverErr:
        log.Fatal().Err(err).Msg("server error")
    case sig := <-quit:
        log.Info().Str("signal", sig.String()).Msg("shutdown signal received")
    }

    // Give in-flight requests 25 seconds to complete — less than the Kubernetes grace period
    shutdownCtx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()

    if err := server.Shutdown(shutdownCtx); err != nil {
        log.Error().Err(err).Msg("graceful shutdown failed")
    } else {
        log.Info().Msg("server stopped cleanly")
    }
}

Multi-Stage Docker Build

The Dockerfile for each service follows the same pattern. The build stage compiles the binary; the runtime stage is minimal — just the binary and Alpine.

# Dockerfile.api-gateway
FROM golang:1.22-alpine AS builder

WORKDIR /app

# Download dependencies first — this layer caches unless go.mod/go.sum changes
COPY go.mod go.sum ./
RUN go mod download

# Copy source and build
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build \
    -ldflags="-s -w -X main.version=${VERSION} -X main.buildTime=${BUILD_TIME}" \
    -o /bin/api-gateway \
    ./cmd/api-gateway

# Runtime image — no build tools, no shell (scratch alternative with certificates)
FROM gcr.io/distroless/static-debian12

COPY --from=builder /bin/api-gateway /api-gateway

USER nonroot:nonroot

EXPOSE 8080 9090

ENTRYPOINT ["/api-gateway"]

I use distroless/static rather than alpine for the runtime image. There's no shell, which reduces the attack surface. The tradeoff is that debugging requires kubectl exec with a debug container — acceptable for a personal project, and good practice to get used to.

The Makefile

A Makefile serves as the project's documentation for "how to do things". New contributors (or future me) shouldn't need to read docs to run tests or build images.

.PHONY: help build test lint docker-build docker-push

# Default target
help:
	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "  %-25s %s\n", $$1, $$2}'

# Go tools
build: ## Build all service binaries
	go build ./cmd/...

test: ## Run all tests with race detector
	go test -race -count=1 ./...

test-coverage: ## Run tests with coverage report
	go test -race -coverprofile=coverage.out ./...
	go tool cover -html=coverage.out -o coverage.html

lint: ## Run golangci-lint
	golangci-lint run ./...

# Docker
docker-build: ## Build Docker images (usage: make docker-build SERVICE=api-gateway VERSION=1.0.0)
	docker build \
		--build-arg VERSION=$(VERSION) \
		--build-arg BUILD_TIME=$(shell date -u +%Y-%m-%dT%H:%M:%SZ) \
		-f Dockerfile.$(SERVICE) \
		-t ghcr.io/htunn/go-reliable/$(SERVICE):$(VERSION) \
		.

docker-push: ## Push Docker image to registry
	docker push ghcr.io/htunn/go-reliable/$(SERVICE):$(VERSION)

# Database
db-migrate: ## Run database migrations
	goose -dir ./migrations postgres "$(DATABASE_URL)" up

db-rollback: ## Roll back last migration
	goose -dir ./migrations postgres "$(DATABASE_URL)" down

# Local development
dev-deps: ## Start local dependencies (PostgreSQL, NATS) via Docker Compose
	docker compose -f docker-compose.dev.yml up -d

dev-deps-down: ## Stop local dependencies
	docker compose -f docker-compose.dev.yml down

run-gateway: ## Run API gateway locally
	SERVICE_NAME=api-gateway PORT=8080 METRICS_PORT=9090 \
	DATABASE_URL="postgres://postgres:postgres@localhost:5432/goreliable" \
	NATS_URL="nats://localhost:4222" \
	go run ./cmd/api-gateway

generate: ## Run go generate across the project
	go generate ./...

The help target using grep is a pattern I use on every project. It self-documents the Makefile without maintaining a separate README section.

What's Not in This Article

I deliberately left out three things that each get their own dedicated treatment:

Health check handler implementation — pkg/health/health.go — covered in Part 2 where I explain how Kubernetes uses liveness vs readiness vs startup probes differently
Prometheus instrumentation — the customMiddleware.Metrics reference in the router — covered in Part 4
OpenTelemetry setup — pkg/telemetry/otel.go — covered in Part 4

Where We Are

At this point I have four Go services that compile, have tests, and run locally against Docker Compose dependencies. The code is structured so that:

Every external dependency is behind an interface (testable without an actual database or NATS)
Every function that does I/O takes a context.Context as its first argument
Shutdown is handled cleanly
Logging emits structured JSON from the start

In Part 2, I package these services into Helm charts and deploy them to Kubernetes with proper health checks, resource limits, and multi-environment values.

PreviousSRE Playbook: Operating Go Microservices in Production NextPart 2: Kubernetes Deployment with Helm Charts

Last updated 4 days ago

hashtagWhy I Started This Project

hashtagProject Structure

hashtagShared Infrastructure (pkg/)

hashtagConfig Loading

hashtagStructured Logger

hashtagDatabase Connection

hashtagThe API Gateway

hashtagRouter Setup

hashtagRequest ID Middleware

hashtagRate Limiting Middleware

hashtagThe Order Service

hashtagThe Notification Worker

hashtagGraceful Shutdown

hashtagMulti-Stage Docker Build

hashtagThe Makefile

hashtagWhat's Not in This Article

hashtagWhere We Are