Part 1: Introduction to SRE - My Journey from Developer to SRE Mindset

What You'll Learn: This article shares my personal journey into Site Reliability Engineering after a production outage taught me that treating operations as a software problem changes everything. You'll learn the core SRE principles from Google's practices, how SRE differs from traditional DevOps, and how to set up your first Go service with reliability in mind. By the end, you'll understand why SRE isn't just about keeping systems running - it's about building reliability into the software itself.

The 2 AM Wake-Up Call

It was 2:17 AM on a Tuesday when my phone started buzzing incessantly. Half-asleep, I grabbed it to see a flood of Slack notifications: "API is down," "Users can't login," "Payment processing failed." My personal project - a simple Go-based task management API that I'd been running for friends and family - had completely crashed.

I stumbled to my laptop, SSH'd into my single DigitalOcean droplet, and found the Go process had died with an out-of-memory error. I restarted it, watched it crash again 10 minutes later, then spent the next three hours debugging. The root cause? A memory leak in my session handling code that only manifested under sustained load.

As I finally crawled back to bed at 5 AM, I realized something fundamental: I was treating operations as an afterthought. I'd built features, written tests, and deployed code - but I hadn't built reliability into the system. That night, I started researching "how Google keeps systems reliable," which led me to discover Site Reliability Engineering.

That painful experience changed how I think about building software.

What is Site Reliability Engineering (SRE)?

After that incident, I dove deep into Google's SRE book and resources. Here's what I learned: SRE is what happens when you treat operations as if it's a software problem.

Traditional operations teams manually manage infrastructure, respond to alerts, and keep systems running through heroic effort. SRE teams write software to automate operations, make systems self-healing, and engineer reliability into the product itself.

The Core SRE Principles I Adopted

Based on Google's SRE practices and my own experience, here are the fundamental principles I now follow:

1. Embracing Risk

Your system doesn't need to be 100% reliable. In fact, aiming for 100% reliability often means you're moving too slowly. I learned to:

Accept that failures will happen
Define acceptable levels of unreliability (error budgets)
Use that budget to make informed decisions about feature velocity vs stability

After my incident, I set a target of 99.9% uptime for my task API. That means I could tolerate ~43 minutes of downtime per month. This freed me to ship features faster while still maintaining good reliability.

2. Eliminating Toil

Toil is repetitive, manual work that doesn't provide lasting value. When I first started, I was manually deploying my Go application via SSH, restarting services by hand, and checking logs manually. All toil.

I started measuring my time:

Manual deployments: ~15 minutes per deploy, 3-4 times per week = 1 hour/week
Responding to known issues: ~30 minutes per incident
Checking system health: ~20 minutes per day = 2.3 hours/week

That was over 3 hours per week on repetitive tasks! I automated all of it using GitHub Actions and Docker.

3. Monitoring Distributed Systems

You can't rely on a system you can't observe. Before my incident, I had basic logging but no metrics, no alerts, and no visibility into what was actually happening.

I implemented the four golden signals for my Go API:

Latency: How long does it take to handle requests?
Traffic: How many requests per second am I serving?
Errors: What's my error rate?
Saturation: How full are my resources (CPU, memory, connections)?

4. Automation Over Manual Intervention

Manual operations don't scale, and they're error-prone (especially at 2 AM). I learned to automate:

Deployments via CI/CD pipelines
Health checks and automatic restarts
Capacity scaling based on metrics
Alert routing and escalation

5. Blameless Post-Mortems

After my incident, I wrote my first post-mortem. Not to blame myself, but to learn:

What happened?
What was the impact?
What was the root cause?
What can I do to prevent this?

This practice transformed how I approach failures. Instead of feeling ashamed, I started treating them as learning opportunities.

SRE vs DevOps: What's the Difference?

When I first learned about SRE, I thought it was just another name for DevOps. It's not. Here's how I understand the difference now:

Aspect

DevOps

SRE

Philosophy

Cultural movement about collaboration

Prescriptive way of doing operations

Focus

Breaking down silos between Dev and Ops

Reliability as a first-class feature

Approach

Principles and practices

Concrete implementation with metrics

Key Metric

Deployment frequency, lead time

Error budgets, SLOs, MTTR

Who Does What

Developers own more of operations

SREs write software to run operations

DevOps says: "Developers and operations should work together." SRE says: "Here's exactly how to work together, measured by these metrics."

I think of SRE as a specific implementation of DevOps philosophy with an emphasis on reliability engineering and concrete practices.

Building Your First Go Service with SRE Principles

Let me show you how I rebuilt my task management API with SRE principles from the start. This is a simplified version of what I run in production.

Project Structure

task-api/
├── cmd/
│   └── server/
│       └── main.go          # Application entry point
├── internal/
│   ├── handlers/            # HTTP handlers
│   ├── middleware/          # Logging, metrics middleware
│   ├── models/              # Data models
│   └── storage/             # Database layer
├── pkg/
│   ├── metrics/             # Prometheus metrics
│   └── health/              # Health check endpoints
├── deployments/
│   ├── docker/
│   │   └── Dockerfile
│   └── k8s/                 # Kubernetes manifests
├── scripts/
│   └── load-test.js         # k6 load testing
└── go.mod

1. Health Checks from Day One

Every service I build now starts with health endpoints. This was missing from my original API.

// pkg/health/health.go
package health

import (
    "context"
    "encoding/json"
    "net/http"
    "time"
)

type Checker interface {
    Check(ctx context.Context) error
    Name() string
}

type Handler struct {
    checks []Checker
}

func NewHandler(checks ...Checker) *Handler {
    return &Handler{checks: checks}
}

// Liveness probe - is the application running?
func (h *Handler) Liveness(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "alive",
        "timestamp": time.Now().Format(time.RFC3339),
    })
}

// Readiness probe - is the application ready to serve traffic?
func (h *Handler) Readiness(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    status := "ready"
    code := http.StatusOK
    results := make(map[string]string)

    for _, check := range h.checks {
        if err := check.Check(ctx); err != nil {
            status = "not_ready"
            code = http.StatusServiceUnavailable
            results[check.Name()] = err.Error()
        } else {
            results[check.Name()] = "ok"
        }
    }

    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(code)
    json.NewEncoder(w).Encode(map[string]interface{}{
        "status": status,
        "checks": results,
        "timestamp": time.Now().Format(time.RFC3339),
    })
}

// Database health check example
type DatabaseCheck struct {
    db interface {
        PingContext(ctx context.Context) error
    }
}

func (d *DatabaseCheck) Name() string {
    return "database"
}

func (d *DatabaseCheck) Check(ctx context.Context) error {
    return d.db.PingContext(ctx)
}

2. Instrumentation with Prometheus Metrics

I instrument every service with Prometheus from the start. This gives me visibility into the four golden signals.

// pkg/metrics/metrics.go
package metrics

import (
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // HTTP request duration histogram
    httpDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Duration of HTTP requests in seconds",
        Buckets: prometheus.DefBuckets,
    }, []string{"method", "path", "status"})

    // HTTP request counter
    httpRequests = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    }, []string{"method", "path", "status"})

    // Active connections gauge
    activeConnections = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "http_active_connections",
        Help: "Number of active HTTP connections",
    })

    // Database connection pool metrics
    dbConnectionsInUse = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_in_use",
        Help: "Number of database connections currently in use",
    })

    dbConnectionsIdle = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_idle",
        Help: "Number of idle database connections",
    })
)

// RecordHTTPMetrics records metrics for an HTTP request
func RecordHTTPMetrics(method, path string, status int, duration time.Duration) {
    statusStr := strconv.Itoa(status)
    httpDuration.WithLabelValues(method, path, statusStr).Observe(duration.Seconds())
    httpRequests.WithLabelValues(method, path, statusStr).Inc()
}

// IncrementActiveConnections increments active connection counter
func IncrementActiveConnections() {
    activeConnections.Inc()
}

// DecrementActiveConnections decrements active connection counter
func DecrementActiveConnections() {
    activeConnections.Dec()
}

// UpdateDBPoolMetrics updates database connection pool metrics
func UpdateDBPoolMetrics(inUse, idle int) {
    dbConnectionsInUse.Set(float64(inUse))
    dbConnectionsIdle.Set(float64(idle))
}

3. Metrics Middleware

I wrap all HTTP handlers with middleware that automatically records metrics.

// internal/middleware/metrics.go
package middleware

import (
    "net/http"
    "time"

    "github.com/yourusername/task-api/pkg/metrics"
)

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// MetricsMiddleware records metrics for each HTTP request
func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        metrics.IncrementActiveConnections()
        defer metrics.DecrementActiveConnections()

        wrapped := &responseWriter{
            ResponseWriter: w,
            statusCode:     http.StatusOK,
        }

        next.ServeHTTP(wrapped, r)

        duration := time.Since(start)
        metrics.RecordHTTPMetrics(r.Method, r.URL.Path, wrapped.statusCode, duration)
    })
}

4. Structured Logging

I use zerolog for structured logging. JSON logs are easier to parse and query.

// internal/middleware/logging.go
package middleware

import (
    "net/http"
    "time"

    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

// LoggingMiddleware logs HTTP requests with structured data
func LoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        wrapped := &responseWriter{
            ResponseWriter: w,
            statusCode:     http.StatusOK,
        }

        next.ServeHTTP(wrapped, r)

        duration := time.Since(start)

        logEvent := log.Info()
        if wrapped.statusCode >= 500 {
            logEvent = log.Error()
        } else if wrapped.statusCode >= 400 {
            logEvent = log.Warn()
        }

        logEvent.
            Str("method", r.Method).
            Str("path", r.URL.Path).
            Str("remote_addr", r.RemoteAddr).
            Int("status", wrapped.statusCode).
            Dur("duration_ms", duration).
            Str("user_agent", r.UserAgent()).
            Msg("HTTP request")
    })
}

5. Graceful Shutdown

The application should shut down gracefully, finishing in-flight requests.

// cmd/server/main.go
package main

import (
    "context"
    "fmt"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
    
    "github.com/yourusername/task-api/internal/middleware"
    "github.com/yourusername/task-api/pkg/health"
)

func main() {
    // Configure structured logging
    zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
    log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})

    // Application routes
    mux := http.NewServeMux()
    
    // Health endpoints
    healthHandler := health.NewHandler()
    mux.HandleFunc("/health/live", healthHandler.Liveness)
    mux.HandleFunc("/health/ready", healthHandler.Readiness)
    
    // Metrics endpoint
    mux.Handle("/metrics", promhttp.Handler())
    
    // Business logic endpoints
    // mux.HandleFunc("/api/tasks", taskHandler.List)
    // ... more handlers

    // Wrap with middleware
    handler := middleware.LoggingMiddleware(
        middleware.MetricsMiddleware(mux),
    )

    // Server configuration
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      handler,
        ReadTimeout:  15 * time.Second,
        WriteTimeout: 15 * time.Second,
        IdleTimeout:  60 * time.Second,
    }

    // Start server in goroutine
    go func() {
        log.Info().Str("addr", srv.Addr).Msg("Starting server")
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatal().Err(err).Msg("Server failed to start")
        }
    }()

    // Wait for interrupt signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    log.Info().Msg("Server is shutting down...")

    // Graceful shutdown with 30 second timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Fatal().Err(err).Msg("Server forced to shutdown")
    }

    log.Info().Msg("Server exited properly")
}

6. Dockerfile with Health Checks

# deployments/docker/Dockerfile
FROM golang:1.21-alpine AS builder

WORKDIR /app

# Copy go mod files
COPY go.mod go.sum ./
RUN go mod download

# Copy source code
COPY . .

# Build the application
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o task-api ./cmd/server

# Final stage
FROM alpine:latest

RUN apk --no-cache add ca-certificates

WORKDIR /root/

# Copy binary from builder
COPY --from=builder /app/task-api .

# Expose application port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health/live || exit 1

# Run the application
CMD ["./task-api"]

Key Lessons from My SRE Journey

After rebuilding my systems with SRE principles, here's what changed:

Incidents became learning opportunities - Instead of dreading failures, I started treating them as data points that improve the system.
Monitoring is not optional - You can't improve what you don't measure. Metrics, logging, and tracing are foundational.
Automate ruthlessly - Every manual task I eliminated freed up time to build better systems.
Reliability is a feature - I now design reliability into my applications from day one, not as an afterthought.
Error budgets changed everything - Having a quantitative measure of acceptable unreliability helped me make better trade-offs between features and stability.

What's Next

This is just the beginning of your SRE journey. In the next parts of this series, we'll dive deep into:

Part 2: Defining and measuring SLIs, SLOs, and SLAs for your Go applications
Part 3: Building comprehensive observability with Prometheus, structured logs, and distributed tracing
Part 4: Managing incidents effectively and writing blameless post-mortems
Part 5: Capacity planning and performance optimization
Part 6: Identifying and eliminating toil through automation

Resources

Based on my learning journey, here are the resources I found most valuable:

Google's SRE Book - The foundational text that started it all
Google's SRE Workbook - Practical exercises and examples
Prometheus Documentation - Essential for metrics
The Art of SLOs - Deep dive into service level objectives

Conclusion

Site Reliability Engineering transformed how I build and operate systems. That 2 AM incident was painful, but it taught me that reliability isn't about heroic effort - it's about engineering principles, automation, and treating operations as a software problem.

Start small: add health checks, instrument one service with metrics, write a post-mortem for your next incident. Each step makes your systems more reliable and your life easier.

In the next article, we'll dive into SLIs, SLOs, and SLAs - the metrics that define what "reliable" actually means for your service.

PreviousSRE 101: Complete Guide NextPart 2: SLIs, SLOs, and SLAs - Building a Reliability Framework

Last updated 11 days ago

hashtagThe 2 AM Wake-Up Call

hashtagWhat is Site Reliability Engineering (SRE)?

hashtagThe Core SRE Principles I Adopted

hashtag1. Embracing Risk

hashtag2. Eliminating Toil

hashtag3. Monitoring Distributed Systems

hashtag4. Automation Over Manual Intervention

hashtag5. Blameless Post-Mortems

hashtagSRE vs DevOps: What's the Difference?

hashtagBuilding Your First Go Service with SRE Principles

hashtagProject Structure

hashtag1. Health Checks from Day One

hashtag2. Instrumentation with Prometheus Metrics

hashtag3. Metrics Middleware

hashtag4. Structured Logging

hashtag5. Graceful Shutdown

hashtag6. Dockerfile with Health Checks

hashtagKey Lessons from My SRE Journey

hashtagWhat's Next

hashtagResources

hashtagConclusion