Part 2: Deployment Strategies

When a Simple Deploy Nearly Took Down My System

Early in my career, I deployed a "minor" API change directly to all production servers simultaneously. Within minutes, our monitoring exploded with alerts—the new code had a subtle bug that only appeared under real production load. We scrambled to roll back, but by then thousands of API calls had failed, and customers were already calling support.

That painful experience taught me that how you deploy is just as important as what you deploy. Since then, I've implemented various deployment strategies that minimize risk and enable quick recovery. This part covers the strategies I use daily: rolling updates, blue/green deployments, canary releases, and effective rollback procedures.

Rolling Updates: The Kubernetes Default

Rolling updates gradually replace old versions with new versions, never taking down the entire service. This is Kubernetes' default strategy, and it's what I recommend for most services.

How Rolling Updates Work

Kubernetes uses a ReplicaSet to manage multiple instances (pods) of your application. During a rolling update:

  1. One or more new pods start with the new version

  2. Once new pods pass health checks, one or more old pods terminate

  3. This process repeats until all pods run the new version

If new pods fail health checks, the rollout pauses automatically—preventing a bad deploy from taking down your service.

My Production Rolling Update Configuration

Here's a deployment configuration I use for a production API service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # Can have 2 extra pods during update
      maxUnavailable: 1  # At most 1 pod can be unavailable
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        version: v2.5.1
    spec:
      containers:
      - name: api
        image: myregistry/api-service:v2.5.1
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3

Key configuration choices:

  • maxSurge: 2: Allows temporarily having 8 pods (6 + 2) to speed up deployment

  • maxUnavailable: 1: Ensures at least 5 pods (6 - 1) remain available during updates

  • Separate readinessProbe and livenessProbe: New pods must pass readiness before receiving traffic

When I Use Rolling Updates

Rolling updates work well for:

  • Stateless services that scale horizontally

  • Services where gradual traffic shift is acceptable

  • Deployments where instant rollback isn't critical

  • Most API services and web applications

I use rolling updates for about 80% of my deployments. They're simple, built into Kubernetes, and require no additional infrastructure.

Watching the Rollout

I monitor rollouts with these commands:

Blue/Green Deployments: Zero-Downtime Switching

Blue/green deployment maintains two identical production environments (blue and green). Only one serves live traffic at a time. When deploying, you update the inactive environment, test it, then switch traffic instantly.

My First Blue/Green Success

I implemented blue/green for a critical payment service that couldn't tolerate any downtime or gradual traffic shifting. The business requirement was clear: "Either the old version works or the new version works—no in-between."

Blue/green solved this perfectly. We could fully test the new version with production infrastructure before a single customer saw it.

Implementing Blue/Green with Kubernetes Services

Here's how I implement blue/green using Kubernetes Services and labels:

Step 1: Blue deployment (current production)

Step 2: Service routing traffic to blue

Step 3: Deploy green (new version)

Step 4: Switch traffic to green

When I Use Blue/Green

Blue/green is my choice for:

  • Critical services requiring zero downtime

  • Deployments needing pre-production validation with real infrastructure

  • Services with complex dependencies benefit from full-stack testing

  • Situations where instant rollback is critical

Tradeoffs I consider:

  • Pro: Instant rollback (just flip the switch back)

  • Pro: Test new version with production infrastructure before traffic hits it

  • Con: Requires 2x infrastructure during deployment

  • Con: More complex than rolling updates

  • Con: Database migrations require careful backward compatibility

Canary Deployments: Testing with Real Traffic

Canary deployment releases new versions to a small subset of users first. If metrics look good, you gradually increase the percentage. If problems appear, you roll back before most users are affected.

The name comes from "canary in a coal mine"—miners used canaries to detect toxic gases. If the canary died, miners evacuated before they got sick. Similarly, if your canary deployment shows issues, you stop before rolling out to everyone.

The Canary That Saved Us

I once deployed a change that passed all automated tests but had a subtle performance regression under specific query patterns. Our canary caught this—we noticed the canary pods had 30% higher latency than stable pods. We rolled back before 95% of users ever saw the new version.

Without canary deployment, we would have deployed to all users, impacted everyone, and spent hours investigating and rolling back under pressure.

Implementing Canaries with ArgoCD and Kubernetes

I use ArgoCD Rollouts for canary deployments. It provides fine-grained control over traffic splitting and metric-based automation.

Step 1: Install Argo Rollouts

Step 2: Create a Rollout resource

Step 3: Service and traffic management

Automated Canary with Prometheus Metrics

The real power of canaries is automated decision-making based on metrics:

If error rate exceeds 1% or P95 latency exceeds 500ms, the rollout automatically aborts and rolls back.

Monitoring Canary Progress

When I Use Canary Deployments

Canary deployments are my go-to for:

  • High-traffic user-facing services where we can tolerate gradual rollout

  • Changes with uncertainty about production behavior

  • Performance changes that need validation under real load

  • A/B testing new features

I use canaries for about 15% of my deployments—particularly for risky changes to critical services.

Rollback Strategies: When Things Go Wrong

Despite all precautions, bad deployments happen. The key is having a fast, reliable rollback strategy.

Kubernetes Built-in Rollback

For deployments using rolling updates:

This works, but it's slow (follows the rolling update pattern) and requires kubectl access.

GitOps Rollback with ArgoCD

With GitOps, rollback is a Git revert:

This is my preferred method because:

  • Rollback is version controlled

  • No kubectl access needed

  • Clear audit trail

  • Same process as normal deployments

Emergency Rollback Procedure I Use

When production is down and every second counts:

For Blue/Green deployments:

For Canary with Argo Rollouts:

For standard Deployments:

Database Rollbacks: The Hard Part

Application rollbacks are easy—database rollbacks are hard. My strategy:

  1. Make migrations backward-compatible: New code should work with old schema, and old code should work with new schema

  2. Separate data changes from code changes: Deploy schema changes separately from application code

  3. Use feature flags: Deploy code with new features disabled, enable after migration completes

  4. Test rollback scenarios: Include migration rollback in testing

Example backward-compatible migration:

Comparison: Which Strategy When?

Strategy
Rollback Speed
Risk Level
Infrastructure Cost
Best For

Rolling Update

Medium (1-5 min)

Low-Medium

1x

Most services

Blue/Green

Instant (<10 sec)

Very Low

2x during deploy

Critical services, complex deps

Canary

Medium (abort fast)

Very Low

1.1-1.3x

High-risk changes, performance testing

My decision tree:

Lessons Learned from Failed Deployments

Lesson 1: Health Checks Are Critical

I once deployed without proper readiness probes. Kubernetes routed traffic to pods before they finished initialization, causing 30 seconds of 503 errors for users. Now I always configure both probes:

  • Liveness probe: Is the app alive? (restart if not)

  • Readiness probe: Is the app ready for traffic? (remove from load balancer if not)

Lesson 2: Resource Limits Prevent Cascading Failures

A memory leak in a new version caused pods to consume all node memory. Configure resource limits:

Lesson 3: Monitor During and After Deployment

I set up automatic alerts that fire if error rates or latencies spike within 10 minutes of deployment. This caught several issues that metrics-based analysis missed.

Lesson 4: Practice Rollbacks

We run rollback drills quarterly. Turns out, the first time we actually needed emergency rollback, muscle memory kicked in and we recovered in under 2 minutes.

Key Takeaways

  1. Rolling updates are your default strategy—simple, built-in, and reliable for most cases

  2. Blue/green provides instant rollback for critical services at the cost of double infrastructure

  3. Canary minimizes blast radius for risky changes using gradual traffic shifting

  4. Automate rollbacks and practice them regularly—when production is down, you want muscle memory, not documentation

  5. Database migrations require special care—make them backward-compatible whenever possible

In the next part, we'll build robust CI/CD pipelines with automated testing gates, quality checks, and environment promotion flows that prevent bad code from reaching production.


Previous: Part 1: Introduction to Release Engineering Next: Part 3: CI/CD Pipeline Best Practices - Testing Gates and Promotion Flows

Last updated