Part 2: Deployment Strategies

When a Simple Deploy Nearly Took Down My System

Early in my career, I deployed a "minor" API change directly to all production servers simultaneously. Within minutes, our monitoring exploded with alerts—the new code had a subtle bug that only appeared under real production load. We scrambled to roll back, but by then thousands of API calls had failed, and customers were already calling support.

That painful experience taught me that how you deploy is just as important as what you deploy. Since then, I've implemented various deployment strategies that minimize risk and enable quick recovery. This part covers the strategies I use daily: rolling updates, blue/green deployments, canary releases, and effective rollback procedures.

Rolling Updates: The Kubernetes Default

Rolling updates gradually replace old versions with new versions, never taking down the entire service. This is Kubernetes' default strategy, and it's what I recommend for most services.

How Rolling Updates Work

Kubernetes uses a ReplicaSet to manage multiple instances (pods) of your application. During a rolling update:

One or more new pods start with the new version
Once new pods pass health checks, one or more old pods terminate
This process repeats until all pods run the new version

If new pods fail health checks, the rollout pauses automatically—preventing a bad deploy from taking down your service.

My Production Rolling Update Configuration

Here's a deployment configuration I use for a production API service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # Can have 2 extra pods during update
      maxUnavailable: 1  # At most 1 pod can be unavailable
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
        version: v2.5.1
    spec:
      containers:
      - name: api
        image: myregistry/api-service:v2.5.1
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3

Key configuration choices:

maxSurge: 2: Allows temporarily having 8 pods (6 + 2) to speed up deployment
maxUnavailable: 1: Ensures at least 5 pods (6 - 1) remain available during updates
Separate readinessProbe and livenessProbe: New pods must pass readiness before receiving traffic

When I Use Rolling Updates

Rolling updates work well for:

Stateless services that scale horizontally
Services where gradual traffic shift is acceptable
Deployments where instant rollback isn't critical
Most API services and web applications

I use rolling updates for about 80% of my deployments. They're simple, built into Kubernetes, and require no additional infrastructure.

Watching the Rollout

I monitor rollouts with these commands:

# Watch deployment progress
kubectl rollout status deployment/api-service -n production

# View rollout history
kubectl rollout history deployment/api-service -n production

# Pause a rollout if issues appear
kubectl rollout pause deployment/api-service -n production

# Resume after investigation
kubectl rollout resume deployment/api-service -n production

# Rollback to previous version
kubectl rollout undo deployment/api-service -n production

Blue/Green Deployments: Zero-Downtime Switching

Blue/green deployment maintains two identical production environments (blue and green). Only one serves live traffic at a time. When deploying, you update the inactive environment, test it, then switch traffic instantly.

My First Blue/Green Success

I implemented blue/green for a critical payment service that couldn't tolerate any downtime or gradual traffic shifting. The business requirement was clear: "Either the old version works or the new version works—no in-between."

Blue/green solved this perfectly. We could fully test the new version with production infrastructure before a single customer saw it.

Implementing Blue/Green with Kubernetes Services

Here's how I implement blue/green using Kubernetes Services and labels:

Step 1: Blue deployment (current production)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service-blue
  namespace: production
spec:
  replicas: 4
  selector:
    matchLabels:
      app: payment-service
      version: blue
  template:
    metadata:
      labels:
        app: payment-service
        version: blue
        color: blue
    spec:
      containers:
      - name: payment
        image: myregistry/payment-service:v1.4.2
        ports:
        - containerPort: 8080

Step 2: Service routing traffic to blue

apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: production
spec:
  selector:
    app: payment-service
    color: blue  # Routes to blue deployment
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: ClusterIP

Step 3: Deploy green (new version)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service-green
  namespace: production
spec:
  replicas: 4
  selector:
    matchLabels:
      app: payment-service
      version: green
  template:
    metadata:
      labels:
        app: payment-service
        version: green
        color: green
    spec:
      containers:
      - name: payment
        image: myregistry/payment-service:v1.5.0
        ports:
        - containerPort: 8080

Step 4: Switch traffic to green

# Test green deployment with port-forward
kubectl port-forward deployment/payment-service-green 8080:8080 -n production

# Run smoke tests against green
./run-smoke-tests.sh http://localhost:8080

# Switch traffic to green (update service selector)
kubectl patch service payment-service -n production -p '{"spec":{"selector":{"color":"green"}}}'

# Monitor metrics for 15 minutes
# If issues appear, instantly switch back to blue
kubectl patch service payment-service -n production -p '{"spec":{"selector":{"color":"blue"}}}'

# If green is stable, delete blue deployment
kubectl delete deployment payment-service-blue -n production

When I Use Blue/Green

Blue/green is my choice for:

Critical services requiring zero downtime
Deployments needing pre-production validation with real infrastructure
Services with complex dependencies benefit from full-stack testing
Situations where instant rollback is critical

Tradeoffs I consider:

Pro: Instant rollback (just flip the switch back)
Pro: Test new version with production infrastructure before traffic hits it
Con: Requires 2x infrastructure during deployment
Con: More complex than rolling updates
Con: Database migrations require careful backward compatibility

Canary Deployments: Testing with Real Traffic

Canary deployment releases new versions to a small subset of users first. If metrics look good, you gradually increase the percentage. If problems appear, you roll back before most users are affected.

The name comes from "canary in a coal mine"—miners used canaries to detect toxic gases. If the canary died, miners evacuated before they got sick. Similarly, if your canary deployment shows issues, you stop before rolling out to everyone.

The Canary That Saved Us

I once deployed a change that passed all automated tests but had a subtle performance regression under specific query patterns. Our canary caught this—we noticed the canary pods had 30% higher latency than stable pods. We rolled back before 95% of users ever saw the new version.

Without canary deployment, we would have deployed to all users, impacted everyone, and spent hours investigating and rolling back under pressure.

Implementing Canaries with ArgoCD and Kubernetes

I use ArgoCD Rollouts for canary deployments. It provides fine-grained control over traffic splitting and metric-based automation.

Step 1: Install Argo Rollouts

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

Step 2: Create a Rollout resource

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10    # 10% traffic to canary
      - pause: {duration: 5m}
      - setWeight: 25    # 25% traffic to canary
      - pause: {duration: 5m}
      - setWeight: 50    # 50% traffic to canary
      - pause: {duration: 5m}
      - setWeight: 75    # 75% traffic to canary
      - pause: {duration: 5m}
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
      - name: api
        image: myregistry/api-service:v2.6.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"

Step 3: Service and traffic management

apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  selector:
    app: api-service
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

Automated Canary with Prometheus Metrics

The real power of canaries is automated decision-making based on metrics:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      analysis:
        templates:
        - templateName: error-rate-analysis
        startingStep: 1  # Start analysis after first step
        interval: 60s
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - setWeight: 25
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 5m}
      - setWeight: 75
      - pause: {duration: 5m}
  selector:
    matchLabels:
      app: api-service
  template:
    # ... same as before
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-analysis
  namespace: production
spec:
  metrics:
  - name: error-rate
    interval: 60s
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{status=~"5..",app="api-service"}[5m])) /
          sum(rate(http_requests_total{app="api-service"}[5m]))
    successCondition: result < 0.01  # Error rate must be below 1%
  - name: latency-p95
    interval: 60s
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{app="api-service"}[5m])) by (le)
          )
    successCondition: result < 0.5  # P95 latency must be below 500ms

If error rate exceeds 1% or P95 latency exceeds 500ms, the rollout automatically aborts and rolls back.

Monitoring Canary Progress

# Watch the canary rollout
kubectl argo rollouts get rollout api-service -n production --watch

# Manually promote to next step if confident
kubectl argo rollouts promote api-service -n production

# Abort the rollout if issues detected
kubectl argo rollouts abort api-service -n production

# Full rollback to previous version
kubectl argo rollouts undo api-service -n production

When I Use Canary Deployments

Canary deployments are my go-to for:

High-traffic user-facing services where we can tolerate gradual rollout
Changes with uncertainty about production behavior
Performance changes that need validation under real load
A/B testing new features

I use canaries for about 15% of my deployments—particularly for risky changes to critical services.

Rollback Strategies: When Things Go Wrong

Despite all precautions, bad deployments happen. The key is having a fast, reliable rollback strategy.

Kubernetes Built-in Rollback

For deployments using rolling updates:

# Rollback to previous version
kubectl rollout undo deployment/api-service -n production

# Rollback to specific revision
kubectl rollout history deployment/api-service -n production
kubectl rollout undo deployment/api-service -n production --to-revision=3

# Check rollback status
kubectl rollout status deployment/api-service -n production

This works, but it's slow (follows the rolling update pattern) and requires kubectl access.

GitOps Rollback with ArgoCD

With GitOps, rollback is a Git revert:

# Find the commit that caused issues
git log --oneline

# Revert the problematic commit
git revert abc123

# Push to trigger ArgoCD sync
git push origin main

# Or manually sync in ArgoCD UI

This is my preferred method because:

Rollback is version controlled
No kubectl access needed
Clear audit trail
Same process as normal deployments

Emergency Rollback Procedure I Use

When production is down and every second counts:

For Blue/Green deployments:

# Switch back to blue (takes 5 seconds)
kubectl patch service payment-service -n production -p '{"spec":{"selector":{"color":"blue"}}}'

For Canary with Argo Rollouts:

# Abort canary (instantly routes all traffic to stable)
kubectl argo rollouts abort api-service -n production

For standard Deployments:

# Quick rollback to previous version
kubectl rollout undo deployment/api-service -n production

Database Rollbacks: The Hard Part

Application rollbacks are easy—database rollbacks are hard. My strategy:

Make migrations backward-compatible: New code should work with old schema, and old code should work with new schema
Separate data changes from code changes: Deploy schema changes separately from application code
Use feature flags: Deploy code with new features disabled, enable after migration completes
Test rollback scenarios: Include migration rollback in testing

Example backward-compatible migration:

-- Adding a new column (backward compatible)
-- Old code won't break if this column exists
ALTER TABLE users ADD COLUMN phone_number VARCHAR(20);

-- Deploy application code that uses phone_number

-- Later, make it NOT NULL after all code is deployed and data backfilled
-- ALTER TABLE users ALTER COLUMN phone_number SET NOT NULL;

Comparison: Which Strategy When?

Strategy

Rollback Speed

Risk Level

Infrastructure Cost

Best For

Rolling Update

Medium (1-5 min)

Low-Medium

Most services

Blue/Green

Instant (<10 sec)

Very Low

2x during deploy

Critical services, complex deps

Canary

Medium (abort fast)

Very Low

1.1-1.3x

High-risk changes, performance testing

My decision tree:

Is it a critical service where downtime means lost revenue?
├─ Yes → Blue/Green
└─ No → Is it a risky change or performance-related?
    ├─ Yes → Canary
    └─ No → Rolling Update

Lessons Learned from Failed Deployments

Lesson 1: Health Checks Are Critical

I once deployed without proper readiness probes. Kubernetes routed traffic to pods before they finished initialization, causing 30 seconds of 503 errors for users. Now I always configure both probes:

Liveness probe: Is the app alive? (restart if not)
Readiness probe: Is the app ready for traffic? (remove from load balancer if not)

Lesson 2: Resource Limits Prevent Cascading Failures

A memory leak in a new version caused pods to consume all node memory. Configure resource limits:

resources:
  requests:  # Guaranteed resources
    memory: "256Mi"
    cpu: "250m"
  limits:    # Maximum allowed
    memory: "512Mi"
    cpu: "500m"

Lesson 3: Monitor During and After Deployment

I set up automatic alerts that fire if error rates or latencies spike within 10 minutes of deployment. This caught several issues that metrics-based analysis missed.

Lesson 4: Practice Rollbacks

We run rollback drills quarterly. Turns out, the first time we actually needed emergency rollback, muscle memory kicked in and we recovered in under 2 minutes.

Key Takeaways

Rolling updates are your default strategy—simple, built-in, and reliable for most cases
Blue/green provides instant rollback for critical services at the cost of double infrastructure
Canary minimizes blast radius for risky changes using gradual traffic shifting
Automate rollbacks and practice them regularly—when production is down, you want muscle memory, not documentation
Database migrations require special care—make them backward-compatible whenever possible

In the next part, we'll build robust CI/CD pipelines with automated testing gates, quality checks, and environment promotion flows that prevent bad code from reaching production.

Previous: Part 1: Introduction to Release Engineering Next: Part 3: CI/CD Pipeline Best Practices - Testing Gates and Promotion Flows

PreviousPart 1: Introduction to Release Engineering NextPart 3: CI/CD Pipeline Best Practices

Last updated 19 hours ago

hashtagWhen a Simple Deploy Nearly Took Down My System

hashtagRolling Updates: The Kubernetes Default

hashtagHow Rolling Updates Work

hashtagMy Production Rolling Update Configuration

hashtagWhen I Use Rolling Updates

hashtagWatching the Rollout

hashtagBlue/Green Deployments: Zero-Downtime Switching

hashtagMy First Blue/Green Success

hashtagImplementing Blue/Green with Kubernetes Services

hashtagWhen I Use Blue/Green

hashtagCanary Deployments: Testing with Real Traffic

hashtagThe Canary That Saved Us

hashtagImplementing Canaries with ArgoCD and Kubernetes

hashtagAutomated Canary with Prometheus Metrics

hashtagMonitoring Canary Progress

hashtagWhen I Use Canary Deployments

hashtagRollback Strategies: When Things Go Wrong

hashtagKubernetes Built-in Rollback

hashtagGitOps Rollback with ArgoCD

hashtagEmergency Rollback Procedure I Use

hashtagDatabase Rollbacks: The Hard Part

hashtagComparison: Which Strategy When?

hashtagLessons Learned from Failed Deployments

hashtagLesson 1: Health Checks Are Critical

hashtagLesson 2: Resource Limits Prevent Cascading Failures

hashtagLesson 3: Monitor During and After Deployment

hashtagLesson 4: Practice Rollbacks

hashtagKey Takeaways