Introduction to Kubernetes Operators

Introduction

I managed a handful of microservices deployed to Kubernetes. The deployment model was consistent enough for a Helm chart, but the operational logic wasn't. Rolling back on health degradation, creating a Service only when a feature flag was set, adjusting HPA thresholds per environment — these decisions couldn't live in YAML. They lived in documentation, runbooks, and Slack DMs.

The answer most people reach for is: add more Helm templating, or wrap everything in a shell script that runs in CI. That works until it doesn't. The Operator pattern exists precisely for this problem: operational knowledge codified as a Kubernetes controller.

This article explains what operators are, how the controller pattern works, and when building one is actually worth it.

The Problem Operators Solve

Kubernetes is excellent at managing stateless resources. A Deployment with a defined replica count reaches steady state reliably. But the real world is full of resources that have a lifecycle, dependencies, and operational logic:

"After the database migration job succeeds, start the API server"
"If pod restart count exceeds N, alert and scale down"
"When a new tenant is created, provision a namespace, RBAC, and quota"
"Keep all three components (Deployment, Service, HPA) in sync as a unit"

None of these are expressible in native Kubernetes YAML. You can bolt them on with Helm hooks, GitOps pipelines, or CronJobs — but you're pulling operational logic out of the cluster and into external systems that don't have full visibility into cluster state.

Operators move operational logic back into the cluster where it belongs.

What Is a Kubernetes Operator?

A Kubernetes Operator is:

A custom controller that manages a custom resource (CRD), encoding domain-specific operational knowledge into a control loop.

Three things make up every operator:

Component

What it is

CRD (CustomResourceDefinition)

Extends the Kubernetes API with a new resource type

Custom Resource (CR)

An instance of the CRD — the desired state you declare

Controller

A Go process that watches CRs and reconciles current state toward desired state

The controller is the heart of it. It runs inside the cluster, watches resources, and continuously drives current state toward the declared desired state. This is the same mechanism Kubernetes itself uses for Deployments, ReplicaSets, and Services.

The Controller Pattern

Every Kubernetes reconciler operates on the same fundamental loop:

  ┌─────────────────────────────────────────────────────────┐
  │                  Reconcile Loop                          │
  │                                                           │
  │   Watch ──► Event Queue ──► Reconcile() ──► API Server  │
  │              (workqueue)         │                        │
  │                                  │                        │
  │                          ┌───────▼────────┐              │
  │                          │  Fetch current  │              │
  │                          │  state from API │              │
  │                          └───────┬────────┘              │
  │                                  │                        │
  │                          ┌───────▼────────┐              │
  │                          │ Compare with   │              │
  │                          │ desired state  │              │
  │                          └───────┬────────┘              │
  │                                  │                        │
  │                          ┌───────▼────────┐              │
  │                          │ Create / Update │              │
  │                          │ / Delete to    │              │
  │                          │ close the gap  │              │
  │                          └───────┬────────┘              │
  │                                  │                        │
  │                          ┌───────▼────────┐              │
  │                          │  Update status │              │
  │                          └───────┬────────┘              │
  │                                  │                        │
  │                          Return ctrl.Result{}             │
  └─────────────────────────────────────────────────────────┘

The key properties of this model:

Idempotent: Reconcile() can be called many times with the same input and it must produce the same result. You never assume the resource doesn't exist; you always check and create-or-update.

Level-triggered, not edge-triggered: The reconciler doesn't receive "what changed". It receives "reconcile this object". It must re-read the full state each time. This makes it resilient to missed events.

Eventual consistency: The controller doesn't need to achieve desired state in a single reconciliation. It can return a Result{Requeue: true} to schedule another pass.

The Reconcile Function Signature

In Go, using controller-runtime:

func (r *AppStackReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // req.NamespacedName is the AppStack that triggered this reconciliation
    // Returns (Result, error):
    //   - (Result{}, nil)               => success, no requeue
    //   - (Result{RequeueAfter: 30s}, nil) => requeue after 30 seconds
    //   - (Result{}, err)               => requeue immediately (with backoff)
}

ctrl.Request carries only the namespaced name of the triggering object — not the object itself, and not what event occurred. The reconciler fetches the current state itself at the start of every call.

Operators vs Alternatives

Before building an operator, it's worth understanding what you're replacing:

Helm Charts

Helm manages templated YAML. It excels at packaging and versioning static resource sets. It falls short when:

Resources have inter-dependencies that need runtime decisions
You need to watch the cluster state and react to it
You want rollback logic based on pod health, not just manifest diffs

Operators and Helm are complementary. The operator manages the lifecycle; Helm can be used to install the operator itself.

GitOps (ArgoCD / Flux)

GitOps syncs a Git repo to a cluster. It's a deployment mechanism, not a resource manager. GitOps doesn't know about your application's health or lifecycle — it just applies YAML.

Operators live inside the cluster and can react to runtime events. GitOps manages what gets applied at the cluster level.

Shell Scripts / CronJobs

Scripts work. They're fine for simple one-way actions. They fall short as operational models because:

No retry semantics built in
No awareness of cluster state — they're external
Difficult to test in isolation
Don't participate in the Kubernetes RBAC and audit model

Operators

Operators are the right tool when:

You need stateful lifecycle management (create → update → teardown with specific behavior at each phase)
You're encoding operational runbooks as code
You want Kubernetes-native status visibility (kubectl describe, kubectl get events)

When to Build an Operator

Build an operator when:

1. You're managing stateful lifecycle across multiple resources You need to create 3+ resources together, and they have ordering or dependency constraints.

2. You encode operational knowledge that isn't in YAML "Scale down if error rate exceeds 5% for 10 minutes" — this is logic, not configuration.

3. You want Kubernetes-native observability Status conditions accessible via kubectl describe, events in kubectl get events.

4. You're building a platform abstraction Platform teams build operators to expose simplified primitives to application teams — AppDeployment, DatabaseInstance, TenantEnvironment.

5. Day-2 operations require automation Certificate rotation, database backup scheduling, secret rotation.

When NOT to Build an Operator

Operators carry real complexity cost. Don't build one when:

Simple YAML suffices If Helm or Kustomize covers your deployment model cleanly, don't add an operator.

You don't own the operational knowledge Operators encode decisions. If you'd be encoding someone else's undocumented runbook, the operator becomes a liability.

The team doesn't know Go Controllers are Go programs. If your team doesn't have Go familiarity, the maintenance burden is high.

You need it deployed in 2 days An operator is a software project. Plan accordingly.

A useful heuristic: if you'd write a CronJob or a post-deploy script to handle the operational logic, consider whether that logic belongs in a controller instead.

The Operator Framework Ecosystem

Three Go projects are fundamental to building Kubernetes operators:

controller-runtime

sigs.k8s.io/controller-runtime is the Go library most operators are built on. It provides:

Manager: starts controllers, handles leader election, serves health/metrics endpoints
Client: typed Kubernetes API client with caching
Reconciler interface: implement Reconcile() to define your logic
Status subresource handling
envtest: in-memory Kubernetes API server for testing

kubebuilder

kubebuilder is the CLI scaffolding tool from the Kubernetes SIGs team. It:

Initializes the project with go.mod, main.go, Makefile
Scaffolds API types (api/v1alpha1/appstack_types.go)
Scaffolds controllers (internal/controller/appstack_controller.go)
Generates CRDs from Go struct markers: make manifests
Generates deep copy methods: make generate

Operator SDK (optional)

Operator SDK wraps kubebuilder with additional tooling for OLM (Operator Lifecycle Manager) packaging. If you're publishing to OperatorHub or need OLM integration, it adds value. For internal operators, kubebuilder is sufficient.

┌─────────────────────────────────────────────────────┐
│                  Your Operator                       │
│                                                       │
│      main.go ──► Manager ──► Reconciler              │
└──────────────────────────┬──────────────────────────┘
                           │ uses
┌──────────────────────────▼──────────────────────────┐
│              controller-runtime                      │
│   Client, Cache, Manager, envtest, metrics           │
└──────────────────────────┬──────────────────────────┘
                           │ uses
┌──────────────────────────▼──────────────────────────┐
│            client-go + apimachinery                  │
│      Kubernetes client, REST mapper, schemes         │
└─────────────────────────────────────────────────────┘

The appstack-operator Project

Throughout this series, I build appstack-operator: an operator that introduces the AppStack CRD.

The problem it solves: every microservice I deploy needs a Deployment, a Service, and often an HPA. Keeping these three resources consistent — using the same image, the same port, the same labels — requires either copy-paste YAML or Helm templating that becomes opaque. I wanted to write:

apiVersion: apps.htunn.io/v1alpha1
kind: AppStack
metadata:
  name: api-service
  namespace: production
spec:
  image: ghcr.io/htunn/api-service:v1.2.0
  port: 8080
  replicas: 3
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    cpuTargetPercent: 60

And have the cluster manage the rest. That's what the operator does.

This is intentionally simple. Learning the controller pattern on a complex domain is harder. AppStack is simple enough to fit in your head while still covering every important operator concept: CRD design, reconcile logic, status conditions, owner references, testing, and deployment.

What I Learned Building My First Operator

A few things I had wrong before I wrote my first operator:

1. "The Reconcile function is called on every change" Not quite. It's called with the name of an object. The function must fetch the object's current state. It's also called on requeue, not just on changes.

2. "I can trust that the cluster state matches what I set last time" No — resources can be deleted, labels changed, or the operator can restart mid-reconcile. Always re-fetch and compare. Never assume.

3. "I only need to watch my CRD" You also need to watch owned resources. If someone deletes the Service your operator created, you want the controller to recreate it. This requires setting owner references and configuring watches on owned types.

4. "Status update is just another patch" Status has its own subresource. Updating .spec and .status require separate API calls. Getting this wrong leads to silent failures or status that never updates.

5. "Testing an operator requires a full cluster" envtest spins up a real API server and etcd binary locally. Your controller integration tests run against real API machinery, no cluster needed.

These are covered in detail in the articles that follow.

Next: Project Setup with kubebuilder →

PreviousKubernetes Operator Development 101 with Go NextProject Setup with kubebuilder

Last updated 3 hours ago

hashtagTable of Contents

hashtagIntroduction

hashtagThe Problem Operators Solve

hashtagWhat Is a Kubernetes Operator?

hashtagThe Controller Pattern

hashtagThe Reconcile Function Signature

hashtagOperators vs Alternatives

hashtagHelm Charts

hashtagGitOps (ArgoCD / Flux)

hashtagShell Scripts / CronJobs

hashtagOperators

hashtagWhen to Build an Operator

hashtagWhen NOT to Build an Operator

hashtagThe Operator Framework Ecosystem

hashtagcontroller-runtime

hashtagkubebuilder

hashtagOperator SDK (optional)

hashtagThe appstack-operator Project

hashtagWhat I Learned Building My First Operator

Table of Contents