Part 7: Programming for Reliability - Building Systems That Don't Break

What You'll Learn: This article shares my journey from writing code that "just works" to building applications designed for reliability from the ground up. You'll learn how to design Go services for failure, implement resilient API patterns, build observable applications, create self-documenting code, design databases for SRE, and adopt testing strategies that prevent production incidents. By the end, you'll know how to write code that operates itself and survives real-world chaos.

The Feature That Took Down Production

I'll never forget deploying what I thought was a simple feature to my Go-based notification service. All it did was send email notifications when users completed transactions. I tested it locally, it worked perfectly, and I shipped it to production on a Friday afternoon.

By Saturday morning, the entire service was down.

What happened? My "simple" feature had a subtle bug:

// The code that killed production
func (s *NotificationService) SendTransactionEmail(ctx context.Context, txn Transaction) error {
    user, err := s.userClient.GetUser(ctx, txn.UserID)
    if err != nil {
        return err  // ❌ Blocks transaction processing if user service is down
    }
    
    email := s.buildEmail(user, txn)
    
    // No timeout!
    err = s.emailClient.Send(email)  // ❌ Can hang forever
    if err != nil {
        return err  // ❌ Fails entire transaction if email fails
    }
    
    return nil
}

The issues:

Tight coupling: Transaction processing failed if the user service was down
No timeouts: Email sending could hang forever
Synchronous: Blocked critical path with non-critical operation
No fallback: All-or-nothing approach

This incident taught me: reliability isn't something you add later - it must be designed into the code from the start.

Designing for Failure: The SRE Mindset

After that production disaster, I adopted a new programming philosophy: assume everything will fail, and design accordingly.

Principle 1: Embrace Failure as Normal

Traditional programming:

// Assumes success
func ProcessOrder(orderID string) {
    order := database.GetOrder(orderID)  // What if DB is down?
    payment := paymentAPI.Charge(order)  // What if payment API is slow?
    email.Send(order.Email)              // What if email service fails?
}

SRE programming:

// Expects failure, handles gracefully
func ProcessOrder(ctx context.Context, orderID string) error {
    // Get order with timeout
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()
    
    order, err := s.orderStore.Get(ctx, orderID)
    if err != nil {
        return fmt.Errorf("failed to fetch order: %w", err)
    }
    
    // Critical path: charge payment with circuit breaker
    payment, err := s.chargeWithRetry(ctx, order)
    if err != nil {
        return fmt.Errorf("payment failed: %w", err)
    }
    
    // Non-critical: send email asynchronously
    s.emailQueue.Publish(EmailJob{
        OrderID: orderID,
        Email:   order.Email,
    })
    
    return nil
}

Principle 2: Fail Fast and Explicitly

Don't let errors propagate silently. Make failures loud and actionable.

// BAD: Silent failure
func (s *Service) UpdateUser(ctx context.Context, userID string, data UserData) {
    user, err := s.db.GetUser(ctx, userID)
    if err != nil {
        log.Printf("error: %v", err)  // ❌ Logs and continues?
        return
    }
    // ... rest of code never executes
}

// GOOD: Explicit error handling
func (s *Service) UpdateUser(ctx context.Context, userID string, data UserData) error {
    user, err := s.db.GetUser(ctx, userID)
    if err != nil {
        metrics.RecordError("user_service", "database_fetch_failed")
        return fmt.Errorf("failed to get user %s: %w", userID, err)
    }
    
    if err := s.validateUserData(data); err != nil {
        metrics.RecordError("user_service", "validation_failed")
        return fmt.Errorf("invalid user data: %w", err)
    }
    
    if err := s.db.UpdateUser(ctx, userID, data); err != nil {
        metrics.RecordError("user_service", "database_update_failed")
        return fmt.Errorf("failed to update user: %w", err)
    }
    
    metrics.RecordSuccess("user_service", "update_user")
    return nil
}

Principle 3: Decouple Critical from Non-Critical

Not all operations are equally important. Separate them.

// My notification service refactored
type NotificationService struct {
    userClient   UserClient
    emailQueue   Queue  // Changed from direct emailClient
    metrics      *metrics.Client
}

// Critical path: process transaction
func (s *Service) ProcessTransaction(ctx context.Context, txn Transaction) error {
    // Critical: save transaction
    if err := s.txnStore.Save(ctx, txn); err != nil {
        return fmt.Errorf("failed to save transaction: %w", err)
    }
    
    // Non-critical: queue notification for async processing
    s.emailQueue.Publish(NotificationJob{
        TransactionID: txn.ID,
        UserID:        txn.UserID,
        Type:          "transaction_complete",
    })
    
    return nil
}

// Separate worker processes notifications
func (s *Service) ProcessNotificationWorker(ctx context.Context) {
    for {
        job, err := s.emailQueue.Consume(ctx)
        if err != nil {
            log.Error().Err(err).Msg("Failed to consume from queue")
            time.Sleep(5 * time.Second)
            continue
        }
        
        // Send email with retries and circuit breaker
        if err := s.sendEmailWithRetry(ctx, job); err != nil {
            log.Error().
                Err(err).
                Str("job_id", job.ID).
                Msg("Failed to send notification after retries")
            
            // Move to dead letter queue for manual review
            s.dlq.Publish(job)
        }
    }
}

Building Resilient APIs in Go

After rebuilding my services with reliability in mind, here's the API design pattern I use for all my Go services.

Complete Resilient HTTP Handler Pattern

// internal/handlers/user_handler.go
package handlers

import (
    "context"
    "encoding/json"
    "errors"
    "net/http"
    "time"

    "github.com/go-playground/validator/v10"
    "github.com/rs/zerolog"
    
    "github.com/yourusername/myapp/internal/domain"
    "github.com/yourusername/myapp/pkg/circuitbreaker"
    "github.com/yourusername/myapp/pkg/metrics"
)

type UserHandler struct {
    userService domain.UserService
    validator   *validator.Validate
    cb          *circuitbreaker.CircuitBreaker
    logger      zerolog.Logger
}

type CreateUserRequest struct {
    Email    string `json:"email" validate:"required,email"`
    Name     string `json:"name" validate:"required,min=2,max=100"`
    Age      int    `json:"age" validate:"required,min=18,max=120"`
}

type CreateUserResponse struct {
    ID        string    `json:"id"`
    Email     string    `json:"email"`
    Name      string    `json:"name"`
    CreatedAt time.Time `json:"created_at"`
}

type ErrorResponse struct {
    Error   string            `json:"error"`
    Code    string            `json:"code"`
    Details map[string]string `json:"details,omitempty"`
}

func (h *UserHandler) CreateUser(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    ctx := r.Context()
    
    // Extract request ID for tracing
    requestID := r.Header.Get("X-Request-ID")
    logger := h.logger.With().Str("request_id", requestID).Logger()
    
    // 1. Decode and validate request
    var req CreateUserRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        logger.Warn().Err(err).Msg("Invalid request body")
        h.respondError(w, "Invalid request body", "INVALID_JSON", http.StatusBadRequest, nil)
        metrics.RecordHTTPRequest("POST", "/users", http.StatusBadRequest, time.Since(start))
        return
    }
    
    // 2. Validate input
    if err := h.validator.Struct(req); err != nil {
        validationErrors := h.formatValidationErrors(err)
        logger.Warn().Interface("errors", validationErrors).Msg("Validation failed")
        h.respondError(w, "Validation failed", "VALIDATION_ERROR", http.StatusBadRequest, validationErrors)
        metrics.RecordHTTPRequest("POST", "/users", http.StatusBadRequest, time.Since(start))
        return
    }
    
    // 3. Add timeout to prevent hanging
    ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
    defer cancel()
    
    // 4. Execute with circuit breaker
    var user *domain.User
    err := h.cb.Call(ctx, func() error {
        var err error
        user, err = h.userService.CreateUser(ctx, domain.CreateUserInput{
            Email: req.Email,
            Name:  req.Name,
            Age:   req.Age,
        })
        return err
    })
    
    // 5. Handle different error types
    if err != nil {
        status, code := h.classifyError(err)
        logger.Error().
            Err(err).
            Int("status", status).
            Str("code", code).
            Msg("Failed to create user")
        
        h.respondError(w, err.Error(), code, status, nil)
        metrics.RecordHTTPRequest("POST", "/users", status, time.Since(start))
        metrics.RecordError("user_handler", "create_user_failed", code)
        return
    }
    
    // 6. Success response
    response := CreateUserResponse{
        ID:        user.ID,
        Email:     user.Email,
        Name:      user.Name,
        CreatedAt: user.CreatedAt,
    }
    
    h.respondJSON(w, response, http.StatusCreated)
    
    logger.Info().
        Str("user_id", user.ID).
        Dur("duration_ms", time.Since(start)).
        Msg("User created successfully")
    
    metrics.RecordHTTPRequest("POST", "/users", http.StatusCreated, time.Since(start))
}

func (h *UserHandler) classifyError(err error) (status int, code string) {
    // Map domain errors to HTTP status codes
    var validationErr *domain.ValidationError
    var notFoundErr *domain.NotFoundError
    var conflictErr *domain.ConflictError
    
    switch {
    case errors.As(err, &validationErr):
        return http.StatusBadRequest, "VALIDATION_ERROR"
    case errors.As(err, &notFoundErr):
        return http.StatusNotFound, "NOT_FOUND"
    case errors.As(err, &conflictErr):
        return http.StatusConflict, "CONFLICT"
    case errors.Is(err, context.DeadlineExceeded):
        return http.StatusGatewayTimeout, "TIMEOUT"
    case errors.Is(err, circuitbreaker.ErrCircuitOpen):
        return http.StatusServiceUnavailable, "SERVICE_UNAVAILABLE"
    default:
        return http.StatusInternalServerError, "INTERNAL_ERROR"
    }
}

func (h *UserHandler) formatValidationErrors(err error) map[string]string {
    validationErrors := make(map[string]string)
    
    if ve, ok := err.(validator.ValidationErrors); ok {
        for _, fe := range ve {
            field := fe.Field()
            tag := fe.Tag()
            
            switch tag {
            case "required":
                validationErrors[field] = "This field is required"
            case "email":
                validationErrors[field] = "Must be a valid email address"
            case "min":
                validationErrors[field] = fmt.Sprintf("Must be at least %s", fe.Param())
            case "max":
                validationErrors[field] = fmt.Sprintf("Must be at most %s", fe.Param())
            default:
                validationErrors[field] = "Invalid value"
            }
        }
    }
    
    return validationErrors
}

func (h *UserHandler) respondJSON(w http.ResponseWriter, data interface{}, status int) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(status)
    json.NewEncoder(w).Encode(data)
}

func (h *UserHandler) respondError(w http.ResponseWriter, message, code string, status int, details map[string]string) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(status)
    json.NewEncoder(w).Encode(ErrorResponse{
        Error:   message,
        Code:    code,
        Details: details,
    })
}

Observability-Driven Development

I learned to build observability into my code as I write it, not as an afterthought.

Pattern: Structured Context Logging

// pkg/logger/context.go
package logger

import (
    "context"

    "github.com/rs/zerolog"
)

type contextKey string

const loggerKey contextKey = "logger"

// WithLogger adds logger to context
func WithLogger(ctx context.Context, logger zerolog.Logger) context.Context {
    return context.WithValue(ctx, loggerKey, logger)
}

// FromContext extracts logger from context
func FromContext(ctx context.Context) zerolog.Logger {
    if logger, ok := ctx.Value(loggerKey).(zerolog.Logger); ok {
        return logger
    }
    return zerolog.Nop()  // No-op logger if not found
}

// WithFields adds fields to context logger
func WithFields(ctx context.Context, fields map[string]interface{}) context.Context {
    logger := FromContext(ctx)
    
    logCtx := logger.With()
    for k, v := range fields {
        logCtx = logCtx.Interface(k, v)
    }
    
    return WithLogger(ctx, logCtx.Logger())
}

Usage in service layer:

// internal/service/user_service.go
func (s *UserService) CreateUser(ctx context.Context, input CreateUserInput) (*User, error) {
    // Add context to logger
    ctx = logger.WithFields(ctx, map[string]interface{}{
        "email": input.Email,
        "operation": "create_user",
    })
    
    log := logger.FromContext(ctx)
    log.Info().Msg("Creating user")
    
    // Check if user exists
    existing, err := s.repo.FindByEmail(ctx, input.Email)
    if err != nil && !errors.Is(err, ErrNotFound) {
        log.Error().Err(err).Msg("Failed to check existing user")
        return nil, fmt.Errorf("database error: %w", err)
    }
    
    if existing != nil {
        log.Warn().Str("existing_user_id", existing.ID).Msg("User already exists")
        return nil, &ConflictError{Message: "User with this email already exists"}
    }
    
    // Create user
    user := &User{
        ID:        generateID(),
        Email:     input.Email,
        Name:      input.Name,
        CreatedAt: time.Now(),
    }
    
    if err := s.repo.Create(ctx, user); err != nil {
        log.Error().Err(err).Msg("Failed to create user in database")
        return nil, fmt.Errorf("failed to create user: %w", err)
    }
    
    log.Info().Str("user_id", user.ID).Msg("User created successfully")
    
    return user, nil
}

Pattern: Trace Every Request

// pkg/tracing/middleware.go
package tracing

import (
    "net/http"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

func TraceMiddleware(serviceName string) func(http.Handler) http.Handler {
    tracer := otel.Tracer(serviceName)
    
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            ctx, span := tracer.Start(r.Context(), r.Method+" "+r.URL.Path,
                trace.WithAttributes(
                    attribute.String("http.method", r.Method),
                    attribute.String("http.url", r.URL.String()),
                    attribute.String("http.user_agent", r.UserAgent()),
                ),
            )
            defer span.End()
            
            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Service layer with tracing:

func (s *UserService) CreateUser(ctx context.Context, input CreateUserInput) (*User, error) {
    ctx, span := s.tracer.Start(ctx, "UserService.CreateUser")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("user.email", input.Email),
        attribute.String("user.name", input.Name),
    )
    
    // ... business logic
    
    if err != nil {
        span.RecordError(err)
        return nil, err
    }
    
    span.SetAttributes(attribute.String("user.id", user.ID))
    return user, nil
}

Database Design for SRE

How you design your database impacts reliability as much as your application code.

Pattern 1: Idempotency Keys

Prevent duplicate operations from network retries:

type Transaction struct {
    ID              string
    IdempotencyKey  string `db:"idempotency_key"` // Unique constraint
    UserID          string
    Amount          float64
    Status          string
    CreatedAt       time.Time
}

func (s *TransactionService) CreateTransaction(ctx context.Context, req CreateTransactionRequest) (*Transaction, error) {
    // Check if already processed
    existing, err := s.repo.FindByIdempotencyKey(ctx, req.IdempotencyKey)
    if err != nil && !errors.Is(err, ErrNotFound) {
        return nil, fmt.Errorf("failed to check idempotency: %w", err)
    }
    
    if existing != nil {
        // Already processed, return existing result
        log.Info().
            Str("idempotency_key", req.IdempotencyKey).
            Str("transaction_id", existing.ID).
            Msg("Returning existing transaction (idempotent)")
        return existing, nil
    }
    
    // Create new transaction
    txn := &Transaction{
        ID:             generateID(),
        IdempotencyKey: req.IdempotencyKey,
        UserID:         req.UserID,
        Amount:         req.Amount,
        Status:         "pending",
        CreatedAt:      time.Now(),
    }
    
    if err := s.repo.Create(ctx, txn); err != nil {
        return nil, fmt.Errorf("failed to create transaction: %w", err)
    }
    
    return txn, nil
}

Database schema:

CREATE TABLE transactions (
    id VARCHAR(255) PRIMARY KEY,
    idempotency_key VARCHAR(255) NOT NULL UNIQUE,  -- Prevents duplicates
    user_id VARCHAR(255) NOT NULL,
    amount DECIMAL(10, 2) NOT NULL,
    status VARCHAR(50) NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    
    INDEX idx_user_id (user_id),
    INDEX idx_idempotency_key (idempotency_key)
);

Pattern 2: Soft Deletes for Auditability

Never actually delete data - mark it as deleted:

type User struct {
    ID        string
    Email     string
    Name      string
    CreatedAt time.Time
    UpdatedAt time.Time
    DeletedAt *time.Time `db:"deleted_at"`  // NULL means active
}

func (r *UserRepository) Delete(ctx context.Context, userID string) error {
    query := `
        UPDATE users 
        SET deleted_at = $1, updated_at = $1
        WHERE id = $2 AND deleted_at IS NULL
    `
    
    result, err := r.db.ExecContext(ctx, query, time.Now(), userID)
    if err != nil {
        return fmt.Errorf("failed to soft delete user: %w", err)
    }
    
    rows, _ := result.RowsAffected()
    if rows == 0 {
        return ErrNotFound
    }
    
    return nil
}

func (r *UserRepository) FindByID(ctx context.Context, userID string) (*User, error) {
    query := `
        SELECT id, email, name, created_at, updated_at, deleted_at
        FROM users
        WHERE id = $1 AND deleted_at IS NULL  -- Only active users
    `
    
    var user User
    err := r.db.GetContext(ctx, &user, query, userID)
    if err != nil {
        if errors.Is(err, sql.ErrNoRows) {
            return nil, ErrNotFound
        }
        return nil, fmt.Errorf("failed to find user: %w", err)
    }
    
    return &user, nil
}

Pattern 3: Optimistic Locking for Concurrency

Prevent lost updates in concurrent environments:

type Account struct {
    ID      string
    Balance float64
    Version int  // Incremented on every update
}

func (s *AccountService) Withdraw(ctx context.Context, accountID string, amount float64) error {
    // Retry loop for optimistic locking conflicts
    maxRetries := 3
    for attempt := 0; attempt < maxRetries; attempt++ {
        // Get current version
        account, err := s.repo.FindByID(ctx, accountID)
        if err != nil {
            return fmt.Errorf("failed to get account: %w", err)
        }
        
        if account.Balance < amount {
            return &ValidationError{Message: "Insufficient balance"}
        }
        
        newBalance := account.Balance - amount
        newVersion := account.Version + 1
        
        // Update with version check
        query := `
            UPDATE accounts 
            SET balance = $1, version = $2, updated_at = $3
            WHERE id = $4 AND version = $5  -- Only update if version matches
        `
        
        result, err := s.db.ExecContext(ctx, query, 
            newBalance, newVersion, time.Now(), accountID, account.Version)
        if err != nil {
            return fmt.Errorf("failed to update account: %w", err)
        }
        
        rows, _ := result.RowsAffected()
        if rows == 0 {
            // Version mismatch - another update happened, retry
            log.Warn().
                Str("account_id", accountID).
                Int("attempt", attempt+1).
                Msg("Optimistic lock conflict, retrying")
            
            time.Sleep(time.Duration(attempt+1) * 100 * time.Millisecond)
            continue
        }
        
        // Success
        return nil
    }
    
    return fmt.Errorf("failed after %d retries due to concurrent updates", maxRetries)
}

Testing for Reliability

I learned to write tests that actually prevent production incidents.

Test Pattern 1: Table-Driven Tests

func TestUserService_CreateUser(t *testing.T) {
    tests := []struct {
        name          string
        input         CreateUserInput
        setupMock     func(*MockUserRepository)
        expectedError error
        expectedUser  *User
    }{
        {
            name: "successful creation",
            input: CreateUserInput{
                Email: "[email protected]",
                Name:  "Test User",
            },
            setupMock: func(m *MockUserRepository) {
                m.On("FindByEmail", mock.Anything, "[email protected]").
                    Return(nil, ErrNotFound)
                m.On("Create", mock.Anything, mock.AnythingOfType("*User")).
                    Return(nil)
            },
            expectedError: nil,
        },
        {
            name: "duplicate email",
            input: CreateUserInput{
                Email: "[email protected]",
                Name:  "Test User",
            },
            setupMock: func(m *MockUserRepository) {
                m.On("FindByEmail", mock.Anything, "[email protected]").
                    Return(&User{ID: "existing-id"}, nil)
            },
            expectedError: &ConflictError{},
        },
        {
            name: "database error on lookup",
            input: CreateUserInput{
                Email: "[email protected]",
                Name:  "Test User",
            },
            setupMock: func(m *MockUserRepository) {
                m.On("FindByEmail", mock.Anything, "[email protected]").
                    Return(nil, errors.New("database connection failed"))
            },
            expectedError: errors.New("database error"),
        },
    }
    
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            mockRepo := new(MockUserRepository)
            tt.setupMock(mockRepo)
            
            service := NewUserService(mockRepo)
            
            user, err := service.CreateUser(context.Background(), tt.input)
            
            if tt.expectedError != nil {
                assert.Error(t, err)
                assert.IsType(t, tt.expectedError, err)
            } else {
                assert.NoError(t, err)
                assert.NotNil(t, user)
                assert.NotEmpty(t, user.ID)
            }
            
            mockRepo.AssertExpectations(t)
        })
    }
}

Test Pattern 2: Integration Tests with Real Database

func TestUserRepository_Integration(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }
    
    // Start test database (using testcontainers)
    ctx := context.Background()
    
    pgContainer, err := postgres.RunContainer(ctx,
        testcontainers.WithImage("postgres:15"),
        postgres.WithDatabase("testdb"),
        postgres.WithUsername("test"),
        postgres.WithPassword("test"),
    )
    require.NoError(t, err)
    defer pgContainer.Terminate(ctx)
    
    // Get connection string
    connStr, err := pgContainer.ConnectionString(ctx)
    require.NoError(t, err)
    
    // Run migrations
    db, err := sql.Open("postgres", connStr)
    require.NoError(t, err)
    defer db.Close()
    
    err = runMigrations(db)
    require.NoError(t, err)
    
    // Test repository
    repo := NewUserRepository(db)
    
    t.Run("create and find user", func(t *testing.T) {
        user := &User{
            ID:    "test-id",
            Email: "[email protected]",
            Name:  "Test User",
        }
        
        err := repo.Create(ctx, user)
        assert.NoError(t, err)
        
        found, err := repo.FindByEmail(ctx, "[email protected]")
        assert.NoError(t, err)
        assert.Equal(t, user.Email, found.Email)
    })
    
    t.Run("prevent duplicate emails", func(t *testing.T) {
        user := &User{
            ID:    "test-id-2",
            Email: "[email protected]",
            Name:  "Test User",
        }
        
        err := repo.Create(ctx, user)
        assert.NoError(t, err)
        
        // Try to create duplicate
        duplicate := &User{
            ID:    "test-id-3",
            Email: "[email protected]",
            Name:  "Another User",
        }
        
        err = repo.Create(ctx, duplicate)
        assert.Error(t, err)
        assert.Contains(t, err.Error(), "unique constraint")
    })
}

Test Pattern 3: Chaos Testing

Test failure scenarios:

func TestUserService_ChaosScenarios(t *testing.T) {
    t.Run("handles database timeout", func(t *testing.T) {
        mockRepo := new(MockUserRepository)
        mockRepo.On("FindByEmail", mock.Anything, mock.Anything).
            Run(func(args mock.Arguments) {
                ctx := args.Get(0).(context.Context)
                <-ctx.Done()  // Wait for context timeout
            }).
            Return(nil, context.DeadlineExceeded)
        
        service := NewUserService(mockRepo)
        
        ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
        defer cancel()
        
        _, err := service.CreateUser(ctx, CreateUserInput{
            Email: "[email protected]",
            Name:  "Test",
        })
        
        assert.Error(t, err)
        assert.True(t, errors.Is(err, context.DeadlineExceeded))
    })
    
    t.Run("handles intermittent failures", func(t *testing.T) {
        callCount := 0
        mockRepo := new(MockUserRepository)
        mockRepo.On("Create", mock.Anything, mock.Anything).
            Run(func(args mock.Arguments) {
                callCount++
                if callCount <= 2 {
                    // Fail first 2 attempts
                    panic("database connection lost")
                }
            }).
            Return(nil)
        
        service := NewUserServiceWithRetry(mockRepo, 3)
        
        user, err := service.CreateUser(context.Background(), CreateUserInput{
            Email: "[email protected]",
            Name:  "Test",
        })
        
        assert.NoError(t, err)
        assert.NotNil(t, user)
        assert.Equal(t, 3, callCount)  // Succeeded on 3rd attempt
    })
}

Self-Documenting Code

Code that documents its reliability characteristics:

// Service defines user management operations.
// All methods are safe for concurrent use.
// All methods implement timeouts via context.
// All methods return structured errors for proper error handling.
type Service interface {
    // CreateUser creates a new user account.
    // Returns ConflictError if email already exists.
    // Returns ValidationError if input is invalid.
    // Operation is idempotent via idempotency_key.
    // Timeout: 5 seconds (database operations)
    CreateUser(ctx context.Context, input CreateUserInput) (*User, error)
    
    // GetUser retrieves a user by ID.
    // Returns NotFoundError if user doesn't exist or is deleted.
    // This operation is read-only and safe to retry.
    // Timeout: 2 seconds (database query)
    GetUser(ctx context.Context, userID string) (*User, error)
    
    // UpdateUser updates user information.
    // Uses optimistic locking to prevent concurrent update conflicts.
    // Returns NotFoundError if user doesn't exist.
    // Returns ConflictError if version mismatch (concurrent update).
    // Timeout: 5 seconds (database operations)
    UpdateUser(ctx context.Context, userID string, input UpdateUserInput) (*User, error)
}

Key Takeaways

Design for failure from the start. Don't bolt on reliability later - bake it into your code architecture.
Decouple critical from non-critical. Use async patterns (queues, workers) for non-critical operations.
Build observability in as you code. Add logging, metrics, and tracing as first-class concerns, not afterthoughts.
Database design impacts reliability. Use idempotency keys, soft deletes, and optimistic locking.
Test failure scenarios. Write tests for timeouts, retries, and concurrent access - not just happy paths.
Make errors explicit and actionable. Return structured errors, log with context, emit metrics.

What You've Learned in This Series

Through this 7-part SRE 101 series, we've covered:

Part 1: SRE fundamentals and treating operations as software
Part 2: Measuring reliability with SLIs, SLOs, and error budgets
Part 3: Building observability with metrics, logs, and traces
Part 4: Managing incidents with processes and blameless culture
Part 5: Planning capacity and optimizing performance
Part 6: Eliminating toil through systematic automation
Part 7: Programming for reliability from the ground up

The journey from reactive firefighting to proactive reliability engineering is complete when your code operates itself, observes itself, and heals itself.

Resources

Conclusion

That Friday afternoon deploy that killed production taught me the most important lesson of my SRE journey: reliability is not a checklist item - it's a programming discipline.

Now when I write code, I think:

What if this dependency is down?
What if this operation times out?
How will I debug this in production?
Can this operation be retried safely?
What metrics do I need to understand this?

These questions have transformed my code from "works on my laptop" to "operates reliably in production."

Start with one service. Apply these patterns. Build reliability in from line one. Your future on-call self will thank you.

The end of SRE 101 - but the beginning of your reliability engineering journey.

PreviousPart 6: Automation and Toil Reduction - Working Smarter, Not Harder NextPrometheus 101

Last updated 11 days ago

hashtagThe Feature That Took Down Production

hashtagDesigning for Failure: The SRE Mindset

hashtagPrinciple 1: Embrace Failure as Normal

hashtagPrinciple 2: Fail Fast and Explicitly

hashtagPrinciple 3: Decouple Critical from Non-Critical

hashtagBuilding Resilient APIs in Go

hashtagComplete Resilient HTTP Handler Pattern

hashtagObservability-Driven Development

hashtagPattern: Structured Context Logging

hashtagPattern: Trace Every Request

hashtagDatabase Design for SRE

hashtagPattern 1: Idempotency Keys

hashtagPattern 2: Soft Deletes for Auditability

hashtagPattern 3: Optimistic Locking for Concurrency

hashtagTesting for Reliability

hashtagTest Pattern 1: Table-Driven Tests

hashtagTest Pattern 2: Integration Tests with Real Database

hashtagTest Pattern 3: Chaos Testing

hashtagSelf-Documenting Code

hashtagKey Takeaways

hashtagWhat You've Learned in This Series

hashtagResources

hashtagConclusion