Part 7: Programming for Reliability - Building Systems That Don't Break
What You'll Learn: This article shares my journey from writing code that "just works" to building applications designed for reliability from the ground up. You'll learn how to design Go services for failure, implement resilient API patterns, build observable applications, create self-documenting code, design databases for SRE, and adopt testing strategies that prevent production incidents. By the end, you'll know how to write code that operates itself and survives real-world chaos.
The Feature That Took Down Production
I'll never forget deploying what I thought was a simple feature to my Go-based notification service. All it did was send email notifications when users completed transactions. I tested it locally, it worked perfectly, and I shipped it to production on a Friday afternoon.
By Saturday morning, the entire service was down.
What happened? My "simple" feature had a subtle bug:
// The code that killed productionfunc(s *NotificationService)SendTransactionEmail(ctxcontext.Context,txnTransaction)error{user,err:=s.userClient.GetUser(ctx,txn.UserID)iferr!=nil{returnerr// β Blocks transaction processing if user service is down}email:=s.buildEmail(user,txn)// No timeout!err=s.emailClient.Send(email)// β Can hang foreveriferr!=nil{returnerr// β Fails entire transaction if email fails}returnnil}
The issues:
Tight coupling: Transaction processing failed if the user service was down
No timeouts: Email sending could hang forever
Synchronous: Blocked critical path with non-critical operation
No fallback: All-or-nothing approach
This incident taught me: reliability isn't something you add later - it must be designed into the code from the start.
Designing for Failure: The SRE Mindset
After that production disaster, I adopted a new programming philosophy: assume everything will fail, and design accordingly.
Principle 1: Embrace Failure as Normal
Traditional programming:
SRE programming:
Principle 2: Fail Fast and Explicitly
Don't let errors propagate silently. Make failures loud and actionable.
Principle 3: Decouple Critical from Non-Critical
Not all operations are equally important. Separate them.
Building Resilient APIs in Go
After rebuilding my services with reliability in mind, here's the API design pattern I use for all my Go services.
Complete Resilient HTTP Handler Pattern
Observability-Driven Development
I learned to build observability into my code as I write it, not as an afterthought.
Pattern: Structured Context Logging
Usage in service layer:
Pattern: Trace Every Request
Service layer with tracing:
Database Design for SRE
How you design your database impacts reliability as much as your application code.
Pattern 1: Idempotency Keys
Prevent duplicate operations from network retries:
Database schema:
Pattern 2: Soft Deletes for Auditability
Never actually delete data - mark it as deleted:
Pattern 3: Optimistic Locking for Concurrency
Prevent lost updates in concurrent environments:
Testing for Reliability
I learned to write tests that actually prevent production incidents.
Test Pattern 1: Table-Driven Tests
Test Pattern 2: Integration Tests with Real Database
Test Pattern 3: Chaos Testing
Test failure scenarios:
Self-Documenting Code
Code that documents its reliability characteristics:
Key Takeaways
Design for failure from the start. Don't bolt on reliability later - bake it into your code architecture.
Decouple critical from non-critical. Use async patterns (queues, workers) for non-critical operations.
Build observability in as you code. Add logging, metrics, and tracing as first-class concerns, not afterthoughts.
Database design impacts reliability. Use idempotency keys, soft deletes, and optimistic locking.
Test failure scenarios. Write tests for timeouts, retries, and concurrent access - not just happy paths.
Make errors explicit and actionable. Return structured errors, log with context, emit metrics.
What You've Learned in This Series
Through this 7-part SRE 101 series, we've covered:
Part 1: SRE fundamentals and treating operations as software
Part 2: Measuring reliability with SLIs, SLOs, and error budgets
Part 3: Building observability with metrics, logs, and traces
Part 4: Managing incidents with processes and blameless culture
Part 5: Planning capacity and optimizing performance
Part 6: Eliminating toil through systematic automation
Part 7: Programming for reliability from the ground up
The journey from reactive firefighting to proactive reliability engineering is complete when your code operates itself, observes itself, and heals itself.
That Friday afternoon deploy that killed production taught me the most important lesson of my SRE journey: reliability is not a checklist item - it's a programming discipline.
Now when I write code, I think:
What if this dependency is down?
What if this operation times out?
How will I debug this in production?
Can this operation be retried safely?
What metrics do I need to understand this?
These questions have transformed my code from "works on my laptop" to "operates reliably in production."
Start with one service. Apply these patterns. Build reliability in from line one. Your future on-call self will thank you.
The end of SRE 101 - but the beginning of your reliability engineering journey.
// Assumes success
func ProcessOrder(orderID string) {
order := database.GetOrder(orderID) // What if DB is down?
payment := paymentAPI.Charge(order) // What if payment API is slow?
email.Send(order.Email) // What if email service fails?
}
CREATE TABLE transactions (
id VARCHAR(255) PRIMARY KEY,
idempotency_key VARCHAR(255) NOT NULL UNIQUE, -- Prevents duplicates
user_id VARCHAR(255) NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
status VARCHAR(50) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user_id (user_id),
INDEX idx_idempotency_key (idempotency_key)
);
type User struct {
ID string
Email string
Name string
CreatedAt time.Time
UpdatedAt time.Time
DeletedAt *time.Time `db:"deleted_at"` // NULL means active
}
func (r *UserRepository) Delete(ctx context.Context, userID string) error {
query := `
UPDATE users
SET deleted_at = $1, updated_at = $1
WHERE id = $2 AND deleted_at IS NULL
`
result, err := r.db.ExecContext(ctx, query, time.Now(), userID)
if err != nil {
return fmt.Errorf("failed to soft delete user: %w", err)
}
rows, _ := result.RowsAffected()
if rows == 0 {
return ErrNotFound
}
return nil
}
func (r *UserRepository) FindByID(ctx context.Context, userID string) (*User, error) {
query := `
SELECT id, email, name, created_at, updated_at, deleted_at
FROM users
WHERE id = $1 AND deleted_at IS NULL -- Only active users
`
var user User
err := r.db.GetContext(ctx, &user, query, userID)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return nil, ErrNotFound
}
return nil, fmt.Errorf("failed to find user: %w", err)
}
return &user, nil
}
type Account struct {
ID string
Balance float64
Version int // Incremented on every update
}
func (s *AccountService) Withdraw(ctx context.Context, accountID string, amount float64) error {
// Retry loop for optimistic locking conflicts
maxRetries := 3
for attempt := 0; attempt < maxRetries; attempt++ {
// Get current version
account, err := s.repo.FindByID(ctx, accountID)
if err != nil {
return fmt.Errorf("failed to get account: %w", err)
}
if account.Balance < amount {
return &ValidationError{Message: "Insufficient balance"}
}
newBalance := account.Balance - amount
newVersion := account.Version + 1
// Update with version check
query := `
UPDATE accounts
SET balance = $1, version = $2, updated_at = $3
WHERE id = $4 AND version = $5 -- Only update if version matches
`
result, err := s.db.ExecContext(ctx, query,
newBalance, newVersion, time.Now(), accountID, account.Version)
if err != nil {
return fmt.Errorf("failed to update account: %w", err)
}
rows, _ := result.RowsAffected()
if rows == 0 {
// Version mismatch - another update happened, retry
log.Warn().
Str("account_id", accountID).
Int("attempt", attempt+1).
Msg("Optimistic lock conflict, retrying")
time.Sleep(time.Duration(attempt+1) * 100 * time.Millisecond)
continue
}
// Success
return nil
}
return fmt.Errorf("failed after %d retries due to concurrent updates", maxRetries)
}
// Service defines user management operations.
// All methods are safe for concurrent use.
// All methods implement timeouts via context.
// All methods return structured errors for proper error handling.
type Service interface {
// CreateUser creates a new user account.
// Returns ConflictError if email already exists.
// Returns ValidationError if input is invalid.
// Operation is idempotent via idempotency_key.
// Timeout: 5 seconds (database operations)
CreateUser(ctx context.Context, input CreateUserInput) (*User, error)
// GetUser retrieves a user by ID.
// Returns NotFoundError if user doesn't exist or is deleted.
// This operation is read-only and safe to retry.
// Timeout: 2 seconds (database query)
GetUser(ctx context.Context, userID string) (*User, error)
// UpdateUser updates user information.
// Uses optimistic locking to prevent concurrent update conflicts.
// Returns NotFoundError if user doesn't exist.
// Returns ConflictError if version mismatch (concurrent update).
// Timeout: 5 seconds (database operations)
UpdateUser(ctx context.Context, userID string, input UpdateUserInput) (*User, error)
}