Part 12: GitOps at Scale — ArgoCD Orchestrating the Full Platform

Part of the SRE Playbook series

What You'll Learn: This article shows the complete GitOps repository structure that manages the entire GoReliable platform — microservices, observability, ML infrastructure, LLM serving, governance jobs, and secrets. It covers ArgoCD ApplicationSets at scale, multi-cluster patterns using a management cluster, Argo Rollouts for progressive delivery across all services, and the operational discipline that keeps GitOps manageable when the platform grows beyond a handful of apps.

The GitOps Repo Structure

After building out through Part 11, the go-reliable-gitops repository has grown into a structured hierarchy that manages everything in the platform:

go-reliable-gitops/
├── argocd/
│   ├── apps/                     # Individual ArgoCD Application manifests
│   │   ├── infrastructure/
│   │   │   ├── cert-manager.yaml
│   │   │   ├── external-secrets.yaml
│   │   │   ├── ingress-nginx.yaml
│   │   │   ├── kube-prometheus-stack.yaml
│   │   │   ├── loki.yaml
│   │   │   ├── tempo.yaml
│   │   │   └── otel-collector.yaml
│   │   ├── kubeflow/
│   │   │   ├── cert-manager.yaml
│   │   │   ├── istio.yaml
│   │   │   ├── kubeflow-pipelines.yaml
│   │   │   ├── katib.yaml
│   │   │   └── kserve.yaml
│   │   └── mlops/
│   │       ├── mlflow.yaml
│   │       └── prometheus-pushgateway.yaml
│   ├── appsets/                  # ApplicationSets (parameterized multi-app)
│   │   ├── microservices.yaml    # Covers all 4 services × 2 envs
│   │   ├── kubeflow.yaml         # KubeFlow wave-ordered deployment
│   │   └── governance-jobs.yaml  # Drift detector, model evaluator CronJobs
│   └── root-app.yaml             # The app-of-apps
├── environments/
│   ├── staging/
│   │   ├── api-gateway/values.yaml
│   │   ├── order-service/values.yaml
│   │   ├── notification-worker/values.yaml
│   │   └── ml-inference-gateway/values.yaml
│   └── production/
│       ├── api-gateway/values.yaml
│       ├── order-service/values.yaml
│       ├── notification-worker/values.yaml
│       └── ml-inference-gateway/values.yaml
├── infrastructure/
│   ├── cert-manager/values.yaml
│   ├── ingress-nginx/values.yaml
│   ├── kube-prometheus-stack/
│   │   ├── values.yaml
│   │   └── dashboards/           # Grafana dashboard JSON files as ConfigMaps
│   ├── loki/values.yaml
│   ├── tempo/values.yaml
│   ├── otel-collector/
│   │   ├── values.yaml
│   │   └── config.yaml
│   ├── mlflow/values.yaml
│   ├── vllm/
│   │   ├── deployment.yaml
│   │   ├── rollout.yaml
│   │   └── service.yaml
│   ├── kserve/
│   │   └── recommendation-model.yaml
│   └── drift-detector/
│       └── cronjob.yaml
├── runbooks/                     # Incident runbooks (referenced by alerts)
│   ├── high-error-rate.md
│   ├── order-service-down.md
│   ├── notification-backlog.md
│   └── model-drift.md
└── external-secrets/             # ExternalSecret CRDs (not values themselves)
    ├── production/
    │   ├── app-secrets.yaml
    │   └── mlflow-credentials.yaml
    └── staging/
        └── app-secrets.yaml

The repository contains no secret values — only ExternalSecret CRDs that reference AWS Secrets Manager paths. No scanning-based secret detection needed because no secrets are ever committed.

The Root App-of-Apps

The root app is the single ArgoCD application I manually created once. Everything else is managed by this app:

# argocd/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: go-reliable-platform
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/htunn/go-reliable-gitops.git
    targetRevision: main
    directory:
      recurse: true
      include: "argocd/apps/**/*.yaml"
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: false    # Never auto-prune the root app's children
      selfHeal: true

The recurse: true with the glob triggers whenever I add a new YAML file to argocd/apps/. New infrastructure apps are one file commit away.

Multi-Environment ApplicationSet

The microservices ApplicationSet from Part 3 manages 8 applications. This is the stable core:

# argocd/appsets/microservices.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: go-reliable-microservices
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - list:
              elements:
                - environment: staging
                  cluster: https://kubernetes.default.svc
                  namespace: go-reliable-staging
                  imageTag: "latest"
                - environment: production
                  cluster: https://kubernetes.default.svc
                  namespace: go-reliable-production
                  imageTag: "stable"
          - git:
              repoURL: https://github.com/htunn/go-reliable-gitops.git
              revision: main
              directories:
                - path: "environments/*/{{environment}}/*"
              # This selects api-gateway, order-service, notification-worker, ml-inference-gateway
  template:
    metadata:
      name: "{{environment}}-{{path.basename}}"
      labels:
        environment: "{{environment}}"
        service: "{{path.basename}}"
    spec:
      project: go-reliable
      sources:
        - repoURL: https://github.com/htunn/go-reliable.git
          targetRevision: main
          path: helm/{{path.basename}}
          helm:
            releaseName: "{{path.basename}}"
            valueFiles:
              - $values/environments/{{environment}}/{{path.basename}}/values.yaml
        - repoURL: https://github.com/htunn/go-reliable-gitops.git
          targetRevision: main
          ref: values
      destination:
        server: "{{cluster}}"
        namespace: "{{namespace}}"
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - RespectIgnoreDifferences=true
      ignoreDifferences:
        - group: apps
          kind: Deployment
          jsonPointers:
            - /spec/replicas   # HPA manages replicas; ArgoCD should not revert them

The ignoreDifferences block for replicas is critical. Without it, ArgoCD would continuously try to revert the replica count that HPA has scaled up, causing a reconciliation fight.

Governance Jobs ApplicationSet

This ApplicationSet manages the scheduled jobs that run operational tasks:

# argocd/appsets/governance-jobs.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: governance-jobs
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/htunn/go-reliable-gitops.git
        revision: main
        directories:
          - path: "infrastructure/*/cronjob.yaml"
  template:
    metadata:
      name: "governance-{{path.basename}}"
    spec:
      project: go-reliable
      source:
        repoURL: https://github.com/htunn/go-reliable-gitops.git
        targetRevision: main
        path: "{{path}}"
      destination:
        server: https://kubernetes.default.svc
        namespace: go-reliable-production
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Adding a new scheduled job is one YAML file commit. No ArgoCD UI changes needed.

Argo Rollouts Across All Services

I've progressively adopted Argo Rollouts for all four microservices. The canary strategy is the same for each, with different success conditions based on the service's SLIs:

# helm/api-gateway/templates/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "api-gateway.fullname" . }}
spec:
  replicas: {{ .Values.replicaCount }}
  strategy:
    canary:
      canaryService: {{ include "api-gateway.fullname" . }}-canary
      stableService: {{ include "api-gateway.fullname" . }}-stable
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - setWeight: 30
        - pause:
            duration: 5m
        - setWeight: 60
        - pause:
            duration: 10m
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate-analysis
        startingStep: 1
        args:
          - name: service-name
            value: api-gateway
  selector:
    matchLabels:
      {{- include "api-gateway.selectorLabels" . | nindent 6 }}

The analysis template is shared across all services via a library chart:

# helm/library-chart/templates/analysis-template.yaml
{{- define "go-reliable.analysisTemplate" -}}
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-analysis
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 2m
      # Require 95% success rate; current production baseline is 99.9%
      # We use a loose threshold because canary traffic is low-volume early
      successCondition: "result[0] >= 0.95"
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
          query: |
            sum(rate(
              goreliable_http_request_duration_seconds_count{
                service="{{args.service-name}}",
                status!~"5.."
              }[5m]
            ))
            /
            sum(rate(
              goreliable_http_request_duration_seconds_count{
                service="{{args.service-name}}"
              }[5m]
            ))
{{- end -}}

If a deploy fails the success rate check at any step, Argo Rollouts automatically aborts and routes 100% of traffic back to the stable version. No human intervention needed.

ArgoCD Project-Level Access Control

As the platform grows, I use ArgoCD Projects to enforce separation between environments:

# argocd/projects/go-reliable.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: go-reliable
  namespace: argocd
spec:
  description: "GoReliable platform — microservices, ML, LLM, and infrastructure"

  sourceRepos:
    - "https://github.com/htunn/go-reliable.git"
    - "https://github.com/htunn/go-reliable-gitops.git"
    - "https://community-charts.github.io/helm-charts"
    - "https://charts.jetstack.io"
    - "https://kubernetes.github.io/ingress-nginx"

  destinations:
    - namespace: "go-reliable-*"
      server: https://kubernetes.default.svc
    - namespace: mlflow
      server: https://kubernetes.default.svc
    - namespace: kubeflow
      server: https://kubernetes.default.svc
    - namespace: monitoring
      server: https://kubernetes.default.svc

  clusterResourceWhitelist:
    - group: "rbac.authorization.k8s.io"
      kind: ClusterRole
    - group: "rbac.authorization.k8s.io"
      kind: ClusterRoleBinding
    - group: "apiextensions.k8s.io"
      kind: CustomResourceDefinition

  namespaceResourceBlacklist:
    - group: ""
      kind: ResourceQuota   # I manage quotas separately, don't let apps change them

Source repo restriction means only my own repos and approved Helm chart repos can deploy applications in this project. This prevents dependency confusion attacks where a malicious package in a public registry is substituted for an internal one.

The One Discipline That Made GitOps Work

Looking back at Parts 1–11, the most important operational discipline wasn't technical:

Everything that changes in production goes through a pull request in the GitOps repo.

Image tag updates: CI creates a commit, ArgoCD deploys.
Config changes: PR, review, merge, ArgoCD deploys.
Model promotions: PR with accuracy comparison, review, merge, ArgoCD deploys.
Infrastructure version bumps: PR updating the Helm chart version in the values file, review, merge.

The PR is not bureaucracy — it's the audit trail. git log --oneline on the GitOps repo is the history of every change that has run on the platform.

The one exception I made was emergency rollbacks. When something is burning, I use argocd app rollback to revert immediately, then clean up the Git history afterward with a revert commit. Speed matters in incidents; audit accuracy matters after.

In Part 13, I put together the full reference architecture — network policies, RBAC, multi-layer reliability, disaster recovery testing, and cost observability — for the complete platform.

PreviousPart 11: ModelOps — Governance, Drift Detection, and Production Lifecycle NextPart 13: Reliability at Every Layer — The Complete Platform Reference

Last updated 4 days ago

hashtagThe GitOps Repo Structure

hashtagThe Root App-of-Apps

hashtagMulti-Environment ApplicationSet

hashtagGovernance Jobs ApplicationSet

hashtagArgo Rollouts Across All Services

hashtagArgoCD Project-Level Access Control

hashtagThe One Discipline That Made GitOps Work