Understanding GitOps Core Concepts

The Day Configuration Drift Cost Us $12,000

It was a normal Wednesday until our AWS bill arrived: $12,000 for the month. Usually $3,000.

I started investigating:

# Check production cluster
kubectl get deployments -n production
# api-service: 3 replicas ✓
# worker-service: 50 replicas ← WHAT?!

kubectl describe deployment worker-service
# Replicas: 50
# Last scaled: 2 weeks ago by [email protected]

I called John. "Did you scale worker-service to 50 replicas?"

"Oh yeah, we had a spike two weeks ago. I forgot to scale it back down."

The manifests in Git said 5 replicas. The cluster was running 50. Configuration drift for two weeks.

This wouldn't happen with proper GitOps. Let me explain why.

Declarative Infrastructure

Declarative = Describing the desired end state, not the steps to get there.

Imperative Approach (The Problem)

# Steps to create desired state (HOW)
kubectl create deployment api --image=api:v1.0
kubectl scale deployment api --replicas=3
kubectl expose deployment api --port=8080
kubectl set image deployment/api api=api:v1.1
kubectl annotate deployment api version=v1.1

# If you run these commands again = errors
# If someone else ran different commands = inconsistent state
# If you want to recreate = need to remember all steps

Problems:

Order matters
Can't re-run safely
No single source of truth
Hard to reproduce

Declarative Approach (The Solution)

# Desired end state (WHAT)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  annotations:
    version: v1.1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
      - name: api
        image: api:v1.1
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
  ports:
  - port: 8080
    targetPort: 8080

Benefits:

Can apply multiple times safely (idempotent)
Order doesn't matter
Clear desired state
Easy to reproduce
Diff-able

# Apply once
kubectl apply -f api.yaml

# Apply again = no changes
kubectl apply -f api.yaml
# deployment.apps/api unchanged
# service/api unchanged

# Change replicas in YAML and apply
kubectl apply -f api.yaml
# deployment.apps/api configured ✓

Desired State vs Actual State

This is the heart of GitOps.

Desired State = What SHOULD be running (defined in Git) Actual State = What IS running (in the cluster)

Example: Desired vs Actual

Desired State (Git):

# manifests/api-deployment.yaml (commit abc123)
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: api:v2.1.0
        env:
        - name: LOG_LEVEL
          value: "info"

Actual State (Cluster):

kubectl get deployment api -o yaml
# spec:
#   replicas: 5          ← Different!
#   template:
#     spec:
#       containers:
#       - name: api
#         image: api:v2.0.0  ← Different!
#         env:
#         - name: LOG_LEVEL
#           value: "debug"   ← Different!

State comparison:

Desired ≠ Actual = OUT OF SYNC
├─ replicas: 3 vs 5
├─ image: v2.1.0 vs v2.0.0
└─ LOG_LEVEL: info vs debug

GitOps agent action: Sync actual → desired

Reconciliation Loops

Reconciliation = Continuously ensuring actual state matches desired state.

This is how GitOps prevents configuration drift.

The Reconciliation Process

Key points:

Continuous loop (default: every 3 minutes)
Always comparing desired (Git) vs actual (cluster)
Auto-corrects drift
Self-healing

Real Example: Auto-Healing

Let's say someone manually scales a deployment:

# Developer manually scales
kubectl scale deployment api --replicas=10
# deployment.apps/api scaled

# Check current state
kubectl get deployment api
# NAME   READY   UP-TO-DATE   AVAILABLE
# api    10/10   10           10

What happens next:

Time 0:00 - Manual scale to 10 replicas
Time 0:00 - Actual state: 10 replicas
Time 0:00 - Desired state (Git): 3 replicas
Time 0:00 - Status: OUT OF SYNC

Time 3:00 - ArgoCD reconciliation loop runs
Time 3:00 - ArgoCD: Desired = 3, Actual = 10
Time 3:00 - ArgoCD: Syncing to desired state...
Time 3:01 - ArgoCD scales deployment to 3 replicas
Time 3:01 - Status: SYNCED ✓

The manual change was automatically reverted. This is GitOps preventing drift.

Reconciliation Configuration

# ArgoCD Application with custom sync interval
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  # How often to check for drift
  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Auto-correct drift
    syncOptions:
    - CreateNamespace=true
  # Check every 3 minutes (default)
  # Can be configured globally in ArgoCD config

Git Workflows for GitOps

Git is the source of truth, so how you use Git matters.

Workflow 1: Branch Per Environment

my-app-manifests/
├── .git/
└── manifests/
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml

Branches:
├── dev         → Dev cluster
├── staging     → Staging cluster
└── main        → Production cluster

Flow:

Develop on dev branch
Merge dev → staging (PR)
Merge staging → main (PR)
Each branch triggers deployment to its environment

Pros:

Simple
Clear environment separation
Easy to promote changes

Cons:

Branch management overhead
Merge conflicts
Hard to compare environments

Workflow 2: Directory Per Environment

my-app-manifests/ (main branch)
├── environments/
│   ├── dev/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── configmap.yaml
│   ├── staging/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── configmap.yaml
│   └── production/
│       ├── deployment.yaml
│       ├── service.yaml
│       └── configmap.yaml

Flow:

All environments on main branch
Different directories for different environments
Change dev → test → merge to main → auto-deploy to dev
Manually sync staging/production via ArgoCD UI

Pros:

Single branch
Easy to compare environments
Clear structure

Cons:

All environments mixed
Need careful ArgoCD app configuration

Workflow 3: Kustomize Overlays (Best Practice)

my-app-manifests/ (main branch)
├── base/
│   ├── deployment.yaml      # Shared base
│   ├── service.yaml
│   └── kustomization.yaml
└── overlays/
    ├── dev/
    │   ├── kustomization.yaml    # Dev customizations
    │   └── replicas-patch.yaml
    ├── staging/
    │   ├── kustomization.yaml    # Staging customizations
    │   └── replicas-patch.yaml
    └── production/
        ├── kustomization.yaml    # Production customizations
        └── replicas-patch.yaml

base/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 1  # Default
  template:
    spec:
      containers:
      - name: api
        image: api:latest

overlays/production/replicas-patch.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 5  # Override for production
  template:
    spec:
      containers:
      - name: api
        image: api:v2.1.0  # Specific version

Pros:

DRY (Don't Repeat Yourself)
Shared base, environment-specific overrides
Easy to manage differences
Native Kubernetes tool

Cons:

Learning curve
More complex structure

Workflow 4: App Repo + Config Repo (Separation)

Repo 1: my-app (application code)
├── src/
├── Dockerfile
└── .github/workflows/
    └── build.yml

Repo 2: my-app-config (manifests)
├── manifests/
    ├── deployment.yaml
    └── service.yaml

Flow:

Developer pushes code to my-app
GitHub Actions builds Docker image: my-app:abc123
GitHub Actions commits to my-app-config: updates image tag
ArgoCD syncs my-app-config → cluster

Pros:

Separation of concerns
Application developers don't touch manifests
Platform team manages manifests
Clean audit trail

Cons:

Two repos to manage
CI needs write access to config repo
More complex setup

Pull vs Push Deployments

This is a critical difference between traditional CD and GitOps.

Push-Based Deployment (Traditional CI/CD)

Flow:

Developer pushes code
CI server builds image
CI server pushes deployment to cluster
CI server has cluster credentials

Problems:

CI server needs cluster access
Cluster credentials stored in CI
Security risk
Hard to audit
No drift detection
Push model = CI controls when

Pull-Based Deployment (GitOps)

Flow:

Developer pushes code
CI server builds image
CI updates image tag in Git
ArgoCD pulls changes from Git
ArgoCD applies to cluster

Benefits:

No cluster credentials in CI
ArgoCD lives in cluster
Pull model = cluster controls when
Continuous drift detection
Better security
Self-healing

Security Comparison

Push model:

# GitHub Actions needs these secrets
KUBE_CONFIG: |
  apiVersion: v1
  kind: Config
  clusters:
  - cluster:
      server: https://prod-cluster.example.com
  users:
  - name: github-actions
    user:
      token: super-secret-token

# If GitHub Actions is compromised = cluster is compromised
# If secret leaks = cluster access leaked

Pull model:

# ArgoCD running in cluster
# No external credentials needed
# CI only updates Git
# Git access ≠ cluster access

# If CI is compromised:
# - Attacker can update Git
# - Cannot directly access cluster
# - Changes are visible in Git (audit trail)
# - Can be reverted

Drift Detection and Remediation

Configuration drift is when actual state diverges from desired state.

How Drift Happens

# Scenario 1: Manual kubectl edit
kubectl edit deployment api
# Change replicas 3 → 10
# Drift created ✗

# Scenario 2: Helm upgrade with --set
helm upgrade api ./api-chart --set replicas=10
# Drift created ✗

# Scenario 3: Direct API call
curl -X PATCH https://k8s-api/apis/apps/v1/namespaces/default/deployments/api \
  -d '{"spec":{"replicas":10}}'
# Drift created ✗

# Scenario 4: Auto-scaler
# HPA scales deployment 3 → 10 based on load
# This is OKAY (managed drift)

ArgoCD Drift Detection

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  syncPolicy:
    automated:
      prune: true       # Delete resources not in Git
      selfHeal: true    # Auto-correct drift
      allowEmpty: false
  # What to do about drift?
  # selfHeal: true  → Auto-correct (recommended)
  # selfHeal: false → Detect but don't fix (manual approval)

With selfHeal: true:

Manual change → ArgoCD detects → Auto-corrects → Back to desired state
Time: ~3 minutes

With selfHeal: false:

Manual change → ArgoCD detects → Status: OUT OF SYNC → Wait for manual sync
Time: Forever (until you click "Sync")

Handling Legitimate Drift

Some drift is okay:

# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

ArgoCD ignores HPA-managed replicas:

# ArgoCD Application
spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas  # Ignore replica drift (HPA manages this)

The GitOps Control Loop

Putting it all together:

The loop:

Developer commits to Git
ArgoCD polls Git (every 3 min)
ArgoCD compares desired (Git) vs actual (cluster)
If different: sync cluster to match Git
Update status: Synced/Healthy
Repeat forever

This continuous loop ensures:

Git is always deployed
Manual changes are reverted
Cluster matches Git
Zero drift tolerance

Key Takeaways

Declarative infrastructure
- Describe WHAT, not HOW
- Idempotent operations
- YAML manifests = desired state
Desired state (Git) vs Actual state (cluster)
- Git = source of truth
- Cluster = current reality
- Goal: Keep them in sync
Reconciliation loops
- Continuous comparison
- Auto-healing
- Prevents drift
- Self-correcting system
Git workflows
- Branch per environment
- Directory per environment
- Kustomize overlays (recommended)
- Separate app + config repos
Pull vs Push
- Pull-based = more secure
- Cluster credentials stay in cluster
- ArgoCD pulls from Git
- CI only builds and updates Git
Drift detection
- Manual changes detected
- Auto-corrected (if selfHeal enabled)
- Some drift is okay (HPA, VPA)
- Configure ignoreDifferences

In the next article, we'll dive deep into ArgoCD architecture: how it's built, what components it has, and how they work together to make GitOps magic happen.

Previous: Introduction to GitOps Next: ArgoCD Architecture and Components

PreviousIntroduction to GitOps NextArgoCD Architecture and Components

Last updated 1 month ago

hashtagThe Day Configuration Drift Cost Us $12,000

hashtagDeclarative Infrastructure

hashtagImperative Approach (The Problem)

hashtagDeclarative Approach (The Solution)

hashtagDesired State vs Actual State

hashtagExample: Desired vs Actual

hashtagReconciliation Loops

hashtagThe Reconciliation Process

hashtagReal Example: Auto-Healing

hashtagReconciliation Configuration

hashtagGit Workflows for GitOps

hashtagWorkflow 1: Branch Per Environment

hashtagWorkflow 2: Directory Per Environment

hashtagWorkflow 3: Kustomize Overlays (Best Practice)

hashtagWorkflow 4: App Repo + Config Repo (Separation)

hashtagPull vs Push Deployments

hashtagPush-Based Deployment (Traditional CI/CD)

hashtagPull-Based Deployment (GitOps)

hashtagSecurity Comparison

hashtagDrift Detection and Remediation

hashtagHow Drift Happens

hashtagArgoCD Drift Detection

hashtagHandling Legitimate Drift

hashtagThe GitOps Control Loop

hashtagKey Takeaways

The Day Configuration Drift Cost Us $12,000

Declarative Infrastructure

Imperative Approach (The Problem)

Declarative Approach (The Solution)

Desired State vs Actual State

Example: Desired vs Actual

Reconciliation Loops

The Reconciliation Process

Real Example: Auto-Healing

Reconciliation Configuration

Git Workflows for GitOps

Workflow 1: Branch Per Environment

Workflow 2: Directory Per Environment

Workflow 3: Kustomize Overlays (Best Practice)

Workflow 4: App Repo + Config Repo (Separation)

Pull vs Push Deployments

Push-Based Deployment (Traditional CI/CD)

Pull-Based Deployment (GitOps)

Security Comparison

Drift Detection and Remediation

How Drift Happens

ArgoCD Drift Detection

Handling Legitimate Drift

The GitOps Control Loop

Key Takeaways