Part 12: GitOps at Scale — ArgoCD Orchestrating the Full Platform

Part of the SRE Playbook series

What You'll Learn: This article shows the complete GitOps repository structure that manages the entire GoReliable platform — microservices, observability, ML infrastructure, LLM serving, governance jobs, and secrets. It covers ArgoCD ApplicationSets at scale, multi-cluster patterns using a management cluster, Argo Rollouts for progressive delivery across all services, and the operational discipline that keeps GitOps manageable when the platform grows beyond a handful of apps.

The GitOps Repo Structure

After building out through Part 11, the go-reliable-gitops repository has grown into a structured hierarchy that manages everything in the platform:

go-reliable-gitops/
├── argocd/
│   ├── apps/                     # Individual ArgoCD Application manifests
│   │   ├── infrastructure/
│   │   │   ├── cert-manager.yaml
│   │   │   ├── external-secrets.yaml
│   │   │   ├── ingress-nginx.yaml
│   │   │   ├── kube-prometheus-stack.yaml
│   │   │   ├── loki.yaml
│   │   │   ├── tempo.yaml
│   │   │   └── otel-collector.yaml
│   │   ├── kubeflow/
│   │   │   ├── cert-manager.yaml
│   │   │   ├── istio.yaml
│   │   │   ├── kubeflow-pipelines.yaml
│   │   │   ├── katib.yaml
│   │   │   └── kserve.yaml
│   │   └── mlops/
│   │       ├── mlflow.yaml
│   │       └── prometheus-pushgateway.yaml
│   ├── appsets/                  # ApplicationSets (parameterized multi-app)
│   │   ├── microservices.yaml    # Covers all 4 services × 2 envs
│   │   ├── kubeflow.yaml         # KubeFlow wave-ordered deployment
│   │   └── governance-jobs.yaml  # Drift detector, model evaluator CronJobs
│   └── root-app.yaml             # The app-of-apps
├── environments/
│   ├── staging/
│   │   ├── api-gateway/values.yaml
│   │   ├── order-service/values.yaml
│   │   ├── notification-worker/values.yaml
│   │   └── ml-inference-gateway/values.yaml
│   └── production/
│       ├── api-gateway/values.yaml
│       ├── order-service/values.yaml
│       ├── notification-worker/values.yaml
│       └── ml-inference-gateway/values.yaml
├── infrastructure/
│   ├── cert-manager/values.yaml
│   ├── ingress-nginx/values.yaml
│   ├── kube-prometheus-stack/
│   │   ├── values.yaml
│   │   └── dashboards/           # Grafana dashboard JSON files as ConfigMaps
│   ├── loki/values.yaml
│   ├── tempo/values.yaml
│   ├── otel-collector/
│   │   ├── values.yaml
│   │   └── config.yaml
│   ├── mlflow/values.yaml
│   ├── vllm/
│   │   ├── deployment.yaml
│   │   ├── rollout.yaml
│   │   └── service.yaml
│   ├── kserve/
│   │   └── recommendation-model.yaml
│   └── drift-detector/
│       └── cronjob.yaml
├── runbooks/                     # Incident runbooks (referenced by alerts)
│   ├── high-error-rate.md
│   ├── order-service-down.md
│   ├── notification-backlog.md
│   └── model-drift.md
└── external-secrets/             # ExternalSecret CRDs (not values themselves)
    ├── production/
    │   ├── app-secrets.yaml
    │   └── mlflow-credentials.yaml
    └── staging/
        └── app-secrets.yaml

The repository contains no secret values — only ExternalSecret CRDs that reference AWS Secrets Manager paths. No scanning-based secret detection needed because no secrets are ever committed.

The Root App-of-Apps

The root app is the single ArgoCD application I manually created once. Everything else is managed by this app:

The recurse: true with the glob triggers whenever I add a new YAML file to argocd/apps/. New infrastructure apps are one file commit away.

Multi-Environment ApplicationSet

The microservices ApplicationSet from Part 3 manages 8 applications. This is the stable core:

The ignoreDifferences block for replicas is critical. Without it, ArgoCD would continuously try to revert the replica count that HPA has scaled up, causing a reconciliation fight.

Governance Jobs ApplicationSet

This ApplicationSet manages the scheduled jobs that run operational tasks:

Adding a new scheduled job is one YAML file commit. No ArgoCD UI changes needed.

Argo Rollouts Across All Services

I've progressively adopted Argo Rollouts for all four microservices. The canary strategy is the same for each, with different success conditions based on the service's SLIs:

The analysis template is shared across all services via a library chart:

If a deploy fails the success rate check at any step, Argo Rollouts automatically aborts and routes 100% of traffic back to the stable version. No human intervention needed.

ArgoCD Project-Level Access Control

As the platform grows, I use ArgoCD Projects to enforce separation between environments:

Source repo restriction means only my own repos and approved Helm chart repos can deploy applications in this project. This prevents dependency confusion attacks where a malicious package in a public registry is substituted for an internal one.

The One Discipline That Made GitOps Work

Looking back at Parts 1–11, the most important operational discipline wasn't technical:

Everything that changes in production goes through a pull request in the GitOps repo.

  • Image tag updates: CI creates a commit, ArgoCD deploys.

  • Config changes: PR, review, merge, ArgoCD deploys.

  • Model promotions: PR with accuracy comparison, review, merge, ArgoCD deploys.

  • Infrastructure version bumps: PR updating the Helm chart version in the values file, review, merge.

The PR is not bureaucracy — it's the audit trail. git log --oneline on the GitOps repo is the history of every change that has run on the platform.

The one exception I made was emergency rollbacks. When something is burning, I use argocd app rollback to revert immediately, then clean up the Git history afterward with a revert commit. Speed matters in incidents; audit accuracy matters after.

In Part 13, I put together the full reference architecture — network policies, RBAC, multi-layer reliability, disaster recovery testing, and cost observability — for the complete platform.

Last updated