Progressive Delivery in Kubernetes with Argo Rollouts and ArgoCD

The Deployment That Went Wrong in 30 Seconds

I deployed a new version of an API service on a Friday afternoon. Green build, passed all tests, reviewed by two engineers.

Within 30 seconds of it hitting production:

kubectl get pods -n production
# NAME                     READY   STATUS             RESTARTS
# api-7f9c4d-xk2lp         0/1     CrashLoopBackOff   3
# api-7f9c4d-m9p3z         0/1     CrashLoopBackOff   3
# api-7f9c4d-q1n8r         0/1     CrashLoopBackOff   3
# (old pods already terminated)

kubectl logs api-7f9c4d-xk2lp
# Error: Cannot read properties of undefined (reading 'config')
# at ServiceBootstrap.init (/app/bootstrap.js:42:23)

A missing environment variable. Every pod was down. 100% of traffic hit dead pods.

The rollback was instant — but the 4 minutes of full outage already fired alerts, paged on-call engineers, and triggered customer-visible errors.

What I needed wasn't faster rollback. I needed a way to discover the problem before 100% of traffic hit it.

That's progressive delivery.

What is Progressive Delivery?

Progressive delivery is a deployment strategy that controls how much of your traffic sees a new version, using automated analysis to decide whether to proceed or roll back — before the problem affects all users.

It builds on top of Continuous Delivery, adding traffic control and observability checkpoints:

Continuous Delivery: Build → Test → Deploy
                                         ↑
                          100% of traffic sees new version immediately

Progressive Delivery: Build → Test → Deploy 5% → Analyze → Deploy 25% → Analyze → 100%
                                        ↑               ↑
                                Limited blast radius   Rollback if metrics degrade

The core idea: release progressively, validate continuously.

Why Kubernetes Deployments Aren't Enough

Native Kubernetes Deployment does support rolling updates:

apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

But there's a critical limitation: traffic is routed purely based on pod count. If you have 10 pods and update 1, roughly 10% of requests hit the new version — but you have no automated way to stop the rollout if error rates spike. Kubernetes will keep rolling out regardless.

The other problem: Kubernetes RollingUpdate doesn't support proper blue-green deployments or traffic-percentage-based canaries. It's a best-effort pod count approximation.

For real progressive delivery in Kubernetes, you need Argo Rollouts.

The Ecosystem: How the Tools Fit Together

Responsibilities:

ArgoCD — Git sync. Applies your Rollout manifests to the cluster.
Argo Rollouts Controller — Manages canary/blue-green logic, traffic shifting.
Prometheus / Datadog — Supplies metrics for automated promotion/rollback decisions.
Istio / NGINX — Performs the actual traffic split at the network layer.

ArgoCD and Argo Rollouts are complementary, not competing. ArgoCD handles reconciliation; Argo Rollouts handles the delivery strategy.

Installing Argo Rollouts

Install the Controller

kubectl create namespace argo-rollouts

kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

Verify the controller is running:

kubectl get pods -n argo-rollouts
# NAME                             READY   STATUS    RESTARTS
# argo-rollouts-5d9c6d9f8b-xp2qk   1/1     Running   0

Install the kubectl Plugin

The kubectl argo rollouts plugin gives you a live-updating dashboard in the terminal:

# macOS
brew install argoproj/tap/kubectl-argo-rollouts

# Linux
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts

Test it:

kubectl argo rollouts version
# argo-rollouts: v1.7.2

Install the ArgoCD Plugin (Optional but Recommended)

This enables the Argo Rollouts UI panel inside ArgoCD's web interface:

# Patch ArgoCD configmap to enable the rollouts plugin
kubectl patch configmap argocd-cm -n argocd --patch '
data:
  resource.customizations: |
    argoproj.io/Rollout:
      health.lua: |
        hs = {}
        if obj.status ~= nil then
          if obj.status.phase == "Degraded" then
            hs.status = "Degraded"
            hs.message = obj.status.message
            return hs
          end
          if obj.status.phase == "Paused" then
            hs.status = "Suspended"
            hs.message = obj.status.message
            return hs
          end
          if obj.status.currentPodHash == obj.status.stableRS then
            if obj.spec.replicas == obj.status.readyReplicas then
              hs.status = "Healthy"
              return hs
            end
          end
        end
        hs.status = "Progressing"
        return hs
'

Strategy 1: Canary Deployments

A canary deployment sends a small percentage of traffic to the new version, gradually increases it, and validates at each step.

The name comes from "canary in a coal mine" — if the canary dies, you know there's danger before it reaches everyone.

The Rollout Resource

Rollout is an Argo Rollouts CRD that replaces the standard Deployment. The spec is nearly identical — you switch kind: Deployment to kind: Rollout and add a strategy section.

# rollout-api.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
      - name: api
        image: myregistry/api-service:v2.1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
  strategy:
    canary:
      steps:
      - setWeight: 5         # Send 5% traffic to canary
      - pause: {duration: 2m}  # Wait 2 minutes, check metrics
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 5m}
      - setWeight: 100       # Full promotion

When you update the image tag, Argo Rollouts:

Creates a new ReplicaSet (canary)
Shifts 5% of traffic to it
Pauses and waits
Continues (or rolls back on failure)

Traffic Shifting with Istio

Pod-count-based traffic splitting is approximate. For precise percentages, wire Argo Rollouts to Istio:

# virtual-service.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api-service-vsvc
  namespace: production
spec:
  hosts:
  - api-service
  http:
  - name: primary
    route:
    - destination:
        host: api-service
        subset: stable
      weight: 100
    - destination:
        host: api-service
        subset: canary
      weight: 0
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: api-service-destrule
  namespace: production
spec:
  host: api-service
  subsets:
  - name: stable
    labels:
      app: api-service
  - name: canary
    labels:
      app: api-service

Reference the VirtualService in your Rollout:

strategy:
  canary:
    trafficRouting:
      istio:
        virtualService:
          name: api-service-vsvc
          routes:
          - primary
    steps:
    - setWeight: 5
    - pause: {duration: 2m}
    - setWeight: 20
    - pause: {duration: 5m}
    - setWeight: 50
    - pause: {duration: 5m}

Now when the controller sets weight: 5, Istio routes exactly 5% — regardless of pod count.

Watching a Canary Rollout Live

kubectl argo rollouts get rollout api-service -n production --watch

Output:

Name:            api-service
Namespace:       production
Status:          ॥ Paused
Message:         CanaryPauseStep
Strategy:        Canary
  Step:          2/6
  SetWeight:     5
  ActualWeight:  5

REVISION  STATUS      PODS  STABLE  CANARY
2         ॥ Paused    1/1   ✔       ✔
1         Healthy     9/9   ✔

Promoting Manually

If you want to skip a pause and promote immediately:

# Promote one step
kubectl argo rollouts promote api-service -n production

# Full promotion (skip all remaining steps)
kubectl argo rollouts promote api-service -n production --full

Aborting (Rolling Back)

kubectl argo rollouts abort api-service -n production

The controller routes 100% traffic back to the stable ReplicaSet and marks the rollout as Degraded. You can then retry after fixing the issue.

Strategy 2: Blue-Green Deployments

Blue-green keeps two full environments alive: blue (current stable) and green (new version). You switch all traffic at once, but green is pre-warmed and validated before the switch.

Before cutover:              After cutover:
┌─────────────┐              ┌─────────────┐
│   blue      │ ← 100%       │   blue      │ ← 0%
│   (stable)  │              │   (stable)  │ (kept for fast rollback)
└─────────────┘              └─────────────┘
┌─────────────┐              ┌─────────────┐
│   green     │ ← 0%         │   green     │ ← 100%
│   (new)     │              │   (new)     │
└─────────────┘              └─────────────┘

The advantage over canary: no partial state. Users are never split between two versions. Critical for database schema changes where you can't have two different app versions talking to the same DB simultaneously.

Blue-Green Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 5
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment
        image: myregistry/payment-service:v3.0.0
        ports:
        - containerPort: 8080
  strategy:
    blueGreen:
      activeService: payment-service-active      # Production traffic
      previewService: payment-service-preview    # Pre-prod traffic (testing)
      autoPromotionEnabled: false                # Manual promotion required
      scaleDownDelaySeconds: 300                 # Keep blue alive 5 min after cutover

Two Services are required:

# Active service (production traffic)
apiVersion: v1
kind: Service
metadata:
  name: payment-service-active
  namespace: production
spec:
  selector:
    app: payment-service
  ports:
  - port: 80
    targetPort: 8080
---
# Preview service (for smoke testing the green environment)
apiVersion: v1
kind: Service
metadata:
  name: payment-service-preview
  namespace: production
spec:
  selector:
    app: payment-service
  ports:
  - port: 80
    targetPort: 8080

Blue-Green Flow

The key step: you run your smoke tests against payment-service-preview before promoting. If anything fails, you just don't promote — blue is still serving 100% of traffic.

Promoting the Blue-Green Rollout

# Check status - green is ready but not active
kubectl argo rollouts get rollout payment-service -n production

# Promote - switches traffic from blue to green
kubectl argo rollouts promote payment-service -n production

Automated Promotion and Rollback with AnalysisRun

Manual promotion works, but the real power is automated analysis: let Prometheus metrics decide whether to proceed or roll back.

AnalysisTemplate

An AnalysisTemplate defines what metrics to query and what constitutes a passing or failing result:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: production
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 1m         # Query every minute
    count: 5             # Run 5 times total
    successCondition: result[0] >= 0.95   # 95% success rate required
    failureLimit: 1      # 1 failure allowed before aborting
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(
            rate(http_requests_total{
              service="{{args.service-name}}",
              status!~"5.."
            }[2m])
          ) /
          sum(
            rate(http_requests_total{
              service="{{args.service-name}}"
            }[2m])
          )

Wiring Analysis into a Canary Rollout

strategy:
  canary:
    analysis:
      templates:
      - templateName: success-rate
      startingStep: 2      # Start analysis at step 2 (after 5% traffic for 2 min)
      args:
      - name: service-name
        value: api-service
    steps:
    - setWeight: 5
    - pause: {duration: 2m}
    - setWeight: 20
    - pause: {duration: 5m}  # Analysis runs here in parallel
    - setWeight: 50
    - pause: {duration: 5m}

Now if the success rate drops below 95% during the pause, Argo Rollouts automatically aborts and routes traffic back to stable — without anyone having to notice or intervene.

Background Analysis

You can also run analysis continuously throughout a canary, not just at pause steps:

strategy:
  canary:
    analysis:
      templates:
      - templateName: success-rate
      args:
      - name: service-name
        value: api-service
    steps:
    - setWeight: 5
    - pause: {duration: 2m}
    - setWeight: 20
    - pause: {duration: 5m}
    - setWeight: 50
    - pause: {duration: 5m}

Background analysis runs from the first step and monitors continuously. A failure at any point triggers automatic rollback.

Datadog as an Analysis Provider

If you use Datadog instead of Prometheus:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-datadog
  namespace: production
spec:
  args:
  - name: service-name
  metrics:
  - name: error-rate
    interval: 1m
    count: 5
    successCondition: result <= 0.01   # Less than 1% error rate
    failureLimit: 2
    provider:
      datadog:
        apiVersion: "v2"
        query: |
          sum:trace.web.request.errors{service:{{args.service-name}}}
          /
          sum:trace.web.request.hits{service:{{args.service-name}}}

Argo Rollouts supports Prometheus, Datadog, New Relic, CloudWatch, Graphite, and custom web hooks.

End-to-End GitOps Flow with Progressive Delivery

Putting it all together with ArgoCD managing the lifecycle:

Repository Structure

config-repo/
├── apps/
│   ├── api-service/
│   │   ├── rollout.yaml          # Rollout CRD (replaces Deployment)
│   │   ├── service.yaml
│   │   ├── analysis-template.yaml
│   │   └── virtual-service.yaml  # If using Istio
├── argocd/
│   └── api-service-app.yaml      # ArgoCD Application

ArgoCD Application Pointing at the Rollout

# argocd/api-service-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/mycompany/config-repo.git
    targetRevision: HEAD
    path: apps/api-service
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

The Complete Deployment Flow

The CI Piece: Updating the Image Tag

The CI pipeline needs to push the new image tag into the config repo. Here's a GitHub Actions job that does it:

# .github/workflows/deploy.yml
name: Build and Deploy

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    steps:
    - uses: actions/checkout@v4

    - name: Generate image tag
      id: meta
      run: echo "tags=myregistry/api-service:sha-$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT

    - name: Build and push
      run: |
        docker build -t ${{ steps.meta.outputs.tags }} .
        docker push ${{ steps.meta.outputs.tags }}

  update-config:
    needs: build
    runs-on: ubuntu-latest
    steps:
    - name: Checkout config repo
      uses: actions/checkout@v4
      with:
        repository: mycompany/config-repo
        token: ${{ secrets.CONFIG_REPO_TOKEN }}

    - name: Update image tag in rollout.yaml
      run: |
        sed -i "s|image: myregistry/api-service:.*|image: ${{ needs.build.outputs.image_tag }}|" \
          apps/api-service/rollout.yaml
        git config user.name "github-actions[bot]"
        git config user.email "github-actions[bot]@users.noreply.github.com"
        git add apps/api-service/rollout.yaml
        git commit -m "deploy: api-service ${{ needs.build.outputs.image_tag }}"
        git push

ArgoCD picks up the config change, applies the updated Rollout, and Argo Rollouts takes control from there.

ArgoCD UI Integration

Once you install the ArgoCD UI plugin for Argo Rollouts, you see the rollout status directly in ArgoCD's application view:

Canary weight percentage
Current step
AnalysisRun status (running / passed / failed)
ReplicaSet breakdown (stable vs canary)
Pause/promote/abort buttons

You can promote or abort directly from the UI without touching the terminal — useful for engineers who aren't deep in kubectl.

Checking Status and Debugging

Getting Rollout Status

# Get all rollouts in namespace
kubectl argo rollouts list rollouts -n production

# Live-watch a specific rollout
kubectl argo rollouts get rollout api-service -n production --watch

# Get analysis run details
kubectl get analysisruns -n production
kubectl describe analysisrun api-service-abc123-5 -n production

When a Rollout Gets Stuck

The most common issue: an analysis failure that isn't obvious from the rollout status.

# Check analysis run
kubectl get analysisrun -n production
# NAME                         STATUS    AGE
# api-service-abc123-5         Failed    3m

# Describe to see why
kubectl describe analysisrun api-service-abc123-5 -n production
# ...
# Message: Metric "success-rate" assessed Failed due to failed (1) > failureLimit (0)
# Last value: 0.89

# Check Prometheus is reachable from within the cluster
kubectl run debug --image=curlimages/curl -it --rm -n production -- \
  curl http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query \
  --data-urlencode 'query=up'

Common causes of stuck rollouts:

Prometheus query returns NaN when a service has zero requests (divide-by-zero). Fix: add or vector(1) fallback.
Analysis template references wrong service label.
failureLimit: 0 with any transient network error causing analysis to fail. Set failureLimit: 1 as a minimum.

Aborting and Retrying

# Rollout is Degraded after failed analysis - fix the issue first
# Then retry
kubectl argo rollouts retry rollout api-service -n production

Converting Existing Deployments

If you have existing Deployment resources, the migration is straightforward:

# Before: standard Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-service
  template:
    ...
  strategy:
    type: RollingUpdate

# After: Argo Rollout
apiVersion: argoproj.io/v1alpha1  # Changed
kind: Rollout                      # Changed
metadata:
  name: api-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-service
  template:
    ...                            # Identical
  strategy:
    canary:                        # New section
      steps:
      - setWeight: 20
      - pause: {duration: 5m}

Delete the old Deployment and apply the Rollout. The pods are recreated but the migration is non-disruptive if you do it during low traffic.

What Canary Can't Protect You From

Progressive delivery isn't a silver bullet. Things it doesn't catch:

Data migration failures — If your migration breaks halfway through, canary routing doesn't help. Handle with blue-green + pre-migration snapshots.
External dependency issues — If a third-party API your service depends on goes down after promotion, that's not something canary traffic analysis will predict.
Metrics lag — Some errors only surface under load that canary percentages don't generate. Consider a longer pause period or a dedicated load test stage before canary.
One-time initialization failures — Some bugs only hit on the first request to a fresh pod. Adjust your readinessProbe so unhealthy pods never enter the canary pool.

What I Learned Running This in Production

Start simple, then add analysis. My first Argo Rollouts setup was just weighted steps with manual pauses — no automated analysis. That alone was a huge improvement over raw kubectl rollout. I added Prometheus analysis only after I had a reliable metrics setup.

The readiness probe is your first gate. A pod that can't pass its readiness probe never joins the canary pool. Put real business-logic checks in your /health endpoint — not just "port is open."

Keep scaleDownDelaySeconds generous on blue-green. The default is 30 seconds. I bumped it to 300 (5 minutes). If you promote and immediately notice something wrong, you can abort within the window and blue is still alive and serving traffic.

Use autoPromotionEnabled: false when starting out. Manual promotion on blue-green gives you a forcing function to run smoke tests against the preview environment. Once your automated tests cover enough surface area, switch to auto.

The analysis query must handle zero-traffic cases. When a service has just started, the request rate is near zero. A Prometheus rate query over 2 minutes might return NaN. Add a default:

query: |
  (
    sum(rate(http_requests_total{service="api",status!~"5.."}[2m])) /
    sum(rate(http_requests_total{service="api"}[2m]))
  ) or vector(1)

or vector(1) returns 1.0 (100% success) when no data is available, which lets the rollout proceed without false failures during warmup.

Summary

Progressive delivery addresses the fundamental problem with traditional deployments: the blast radius is always 100%.

With Argo Rollouts and ArgoCD:

Canary — limit exposure to a percentage of real traffic, analyze, proceed or roll back automatically.
Blue-green — run two full environments, test before切ting traffic, instant cutover.
AnalysisRun — tie Prometheus/Datadog metrics to automatic promotion/rollback decisions.
ArgoCD — manage all of it declaratively from Git, with full visibility in the UI.

The deployment that crashed my API in 30 seconds? With a 5% canary and a 2-minute analysis window, it would have been caught at 5 pods instead of all 10 — with an automatic rollback before I even got the alert.

That's the goal: deploy with confidence, not prayers.

hashtagThe Deployment That Went Wrong in 30 Seconds

hashtagWhat is Progressive Delivery?

hashtagWhy Kubernetes Deployments Aren't Enough

hashtagThe Ecosystem: How the Tools Fit Together

hashtagInstalling Argo Rollouts

hashtagInstall the Controller

hashtagInstall the kubectl Plugin

hashtagInstall the ArgoCD Plugin (Optional but Recommended)

hashtagStrategy 1: Canary Deployments

hashtagThe Rollout Resource

hashtagTraffic Shifting with Istio

hashtagWatching a Canary Rollout Live

hashtagPromoting Manually

hashtagAborting (Rolling Back)

hashtagStrategy 2: Blue-Green Deployments

hashtagBlue-Green Rollout

hashtagBlue-Green Flow

hashtagPromoting the Blue-Green Rollout

hashtagAutomated Promotion and Rollback with AnalysisRun

hashtagAnalysisTemplate

hashtagWiring Analysis into a Canary Rollout

hashtagBackground Analysis

hashtagDatadog as an Analysis Provider

hashtagEnd-to-End GitOps Flow with Progressive Delivery

hashtagRepository Structure

hashtagArgoCD Application Pointing at the Rollout

hashtagThe Complete Deployment Flow

hashtagThe CI Piece: Updating the Image Tag

hashtagArgoCD UI Integration

hashtagChecking Status and Debugging

hashtagGetting Rollout Status

hashtagWhen a Rollout Gets Stuck

hashtagAborting and Retrying

hashtagConverting Existing Deployments

hashtagWhat Canary Can't Protect You From

hashtagWhat I Learned Running This in Production

hashtagSummary

hashtagFurther Reading