Progressive Delivery in Kubernetes with Argo Rollouts and ArgoCD

The Deployment That Went Wrong in 30 Seconds

I deployed a new version of an API service on a Friday afternoon. Green build, passed all tests, reviewed by two engineers.

Within 30 seconds of it hitting production:

kubectl get pods -n production
# NAME                     READY   STATUS             RESTARTS
# api-7f9c4d-xk2lp         0/1     CrashLoopBackOff   3
# api-7f9c4d-m9p3z         0/1     CrashLoopBackOff   3
# api-7f9c4d-q1n8r         0/1     CrashLoopBackOff   3
# (old pods already terminated)

kubectl logs api-7f9c4d-xk2lp
# Error: Cannot read properties of undefined (reading 'config')
# at ServiceBootstrap.init (/app/bootstrap.js:42:23)

A missing environment variable. Every pod was down. 100% of traffic hit dead pods.

The rollback was instant β€” but the 4 minutes of full outage already fired alerts, paged on-call engineers, and triggered customer-visible errors.

What I needed wasn't faster rollback. I needed a way to discover the problem before 100% of traffic hit it.

That's progressive delivery.

What is Progressive Delivery?

Progressive delivery is a deployment strategy that controls how much of your traffic sees a new version, using automated analysis to decide whether to proceed or roll back β€” before the problem affects all users.

It builds on top of Continuous Delivery, adding traffic control and observability checkpoints:

The core idea: release progressively, validate continuously.

Why Kubernetes Deployments Aren't Enough

Native Kubernetes Deployment does support rolling updates:

But there's a critical limitation: traffic is routed purely based on pod count. If you have 10 pods and update 1, roughly 10% of requests hit the new version β€” but you have no automated way to stop the rollout if error rates spike. Kubernetes will keep rolling out regardless.

The other problem: Kubernetes RollingUpdate doesn't support proper blue-green deployments or traffic-percentage-based canaries. It's a best-effort pod count approximation.

For real progressive delivery in Kubernetes, you need Argo Rollouts.

The Ecosystem: How the Tools Fit Together

spinner

Responsibilities:

  • ArgoCD β€” Git sync. Applies your Rollout manifests to the cluster.

  • Argo Rollouts Controller β€” Manages canary/blue-green logic, traffic shifting.

  • Prometheus / Datadog β€” Supplies metrics for automated promotion/rollback decisions.

  • Istio / NGINX β€” Performs the actual traffic split at the network layer.

ArgoCD and Argo Rollouts are complementary, not competing. ArgoCD handles reconciliation; Argo Rollouts handles the delivery strategy.

Installing Argo Rollouts

Install the Controller

Verify the controller is running:

Install the kubectl Plugin

The kubectl argo rollouts plugin gives you a live-updating dashboard in the terminal:

Test it:

This enables the Argo Rollouts UI panel inside ArgoCD's web interface:

Strategy 1: Canary Deployments

A canary deployment sends a small percentage of traffic to the new version, gradually increases it, and validates at each step.

The name comes from "canary in a coal mine" β€” if the canary dies, you know there's danger before it reaches everyone.

The Rollout Resource

Rollout is an Argo Rollouts CRD that replaces the standard Deployment. The spec is nearly identical β€” you switch kind: Deployment to kind: Rollout and add a strategy section.

When you update the image tag, Argo Rollouts:

  1. Creates a new ReplicaSet (canary)

  2. Shifts 5% of traffic to it

  3. Pauses and waits

  4. Continues (or rolls back on failure)

Traffic Shifting with Istio

Pod-count-based traffic splitting is approximate. For precise percentages, wire Argo Rollouts to Istio:

Reference the VirtualService in your Rollout:

Now when the controller sets weight: 5, Istio routes exactly 5% β€” regardless of pod count.

Watching a Canary Rollout Live

Output:

Promoting Manually

If you want to skip a pause and promote immediately:

Aborting (Rolling Back)

The controller routes 100% traffic back to the stable ReplicaSet and marks the rollout as Degraded. You can then retry after fixing the issue.

Strategy 2: Blue-Green Deployments

Blue-green keeps two full environments alive: blue (current stable) and green (new version). You switch all traffic at once, but green is pre-warmed and validated before the switch.

The advantage over canary: no partial state. Users are never split between two versions. Critical for database schema changes where you can't have two different app versions talking to the same DB simultaneously.

Blue-Green Rollout

Two Services are required:

Blue-Green Flow

spinner

The key step: you run your smoke tests against payment-service-preview before promoting. If anything fails, you just don't promote β€” blue is still serving 100% of traffic.

Promoting the Blue-Green Rollout

Automated Promotion and Rollback with AnalysisRun

Manual promotion works, but the real power is automated analysis: let Prometheus metrics decide whether to proceed or roll back.

AnalysisTemplate

An AnalysisTemplate defines what metrics to query and what constitutes a passing or failing result:

Wiring Analysis into a Canary Rollout

Now if the success rate drops below 95% during the pause, Argo Rollouts automatically aborts and routes traffic back to stable β€” without anyone having to notice or intervene.

Background Analysis

You can also run analysis continuously throughout a canary, not just at pause steps:

Background analysis runs from the first step and monitors continuously. A failure at any point triggers automatic rollback.

Datadog as an Analysis Provider

If you use Datadog instead of Prometheus:

Argo Rollouts supports Prometheus, Datadog, New Relic, CloudWatch, Graphite, and custom web hooks.

End-to-End GitOps Flow with Progressive Delivery

Putting it all together with ArgoCD managing the lifecycle:

Repository Structure

ArgoCD Application Pointing at the Rollout

The Complete Deployment Flow

spinner

The CI Piece: Updating the Image Tag

The CI pipeline needs to push the new image tag into the config repo. Here's a GitHub Actions job that does it:

ArgoCD picks up the config change, applies the updated Rollout, and Argo Rollouts takes control from there.

ArgoCD UI Integration

Once you install the ArgoCD UI plugin for Argo Rollouts, you see the rollout status directly in ArgoCD's application view:

  • Canary weight percentage

  • Current step

  • AnalysisRun status (running / passed / failed)

  • ReplicaSet breakdown (stable vs canary)

  • Pause/promote/abort buttons

You can promote or abort directly from the UI without touching the terminal β€” useful for engineers who aren't deep in kubectl.

Checking Status and Debugging

Getting Rollout Status

When a Rollout Gets Stuck

The most common issue: an analysis failure that isn't obvious from the rollout status.

Common causes of stuck rollouts:

  • Prometheus query returns NaN when a service has zero requests (divide-by-zero). Fix: add or vector(1) fallback.

  • Analysis template references wrong service label.

  • failureLimit: 0 with any transient network error causing analysis to fail. Set failureLimit: 1 as a minimum.

Aborting and Retrying

Converting Existing Deployments

If you have existing Deployment resources, the migration is straightforward:

Delete the old Deployment and apply the Rollout. The pods are recreated but the migration is non-disruptive if you do it during low traffic.

What Canary Can't Protect You From

Progressive delivery isn't a silver bullet. Things it doesn't catch:

  • Data migration failures β€” If your migration breaks halfway through, canary routing doesn't help. Handle with blue-green + pre-migration snapshots.

  • External dependency issues β€” If a third-party API your service depends on goes down after promotion, that's not something canary traffic analysis will predict.

  • Metrics lag β€” Some errors only surface under load that canary percentages don't generate. Consider a longer pause period or a dedicated load test stage before canary.

  • One-time initialization failures β€” Some bugs only hit on the first request to a fresh pod. Adjust your readinessProbe so unhealthy pods never enter the canary pool.

What I Learned Running This in Production

Start simple, then add analysis. My first Argo Rollouts setup was just weighted steps with manual pauses β€” no automated analysis. That alone was a huge improvement over raw kubectl rollout. I added Prometheus analysis only after I had a reliable metrics setup.

The readiness probe is your first gate. A pod that can't pass its readiness probe never joins the canary pool. Put real business-logic checks in your /health endpoint β€” not just "port is open."

Keep scaleDownDelaySeconds generous on blue-green. The default is 30 seconds. I bumped it to 300 (5 minutes). If you promote and immediately notice something wrong, you can abort within the window and blue is still alive and serving traffic.

Use autoPromotionEnabled: false when starting out. Manual promotion on blue-green gives you a forcing function to run smoke tests against the preview environment. Once your automated tests cover enough surface area, switch to auto.

The analysis query must handle zero-traffic cases. When a service has just started, the request rate is near zero. A Prometheus rate query over 2 minutes might return NaN. Add a default:

or vector(1) returns 1.0 (100% success) when no data is available, which lets the rollout proceed without false failures during warmup.

Summary

Progressive delivery addresses the fundamental problem with traditional deployments: the blast radius is always 100%.

With Argo Rollouts and ArgoCD:

  • Canary β€” limit exposure to a percentage of real traffic, analyze, proceed or roll back automatically.

  • Blue-green β€” run two full environments, test beforeεˆ‡ting traffic, instant cutover.

  • AnalysisRun β€” tie Prometheus/Datadog metrics to automatic promotion/rollback decisions.

  • ArgoCD β€” manage all of it declaratively from Git, with full visibility in the UI.

The deployment that crashed my API in 30 seconds? With a 5% canary and a 2-minute analysis window, it would have been caught at 5 pods instead of all 10 β€” with an automatic rollback before I even got the alert.

That's the goal: deploy with confidence, not prayers.


Further Reading

Last updated