Introduction to Kubernetes Operators

Table of Contents


Introduction

I managed a handful of microservices deployed to Kubernetes. The deployment model was consistent enough for a Helm chart, but the operational logic wasn't. Rolling back on health degradation, creating a Service only when a feature flag was set, adjusting HPA thresholds per environment β€” these decisions couldn't live in YAML. They lived in documentation, runbooks, and Slack DMs.

The answer most people reach for is: add more Helm templating, or wrap everything in a shell script that runs in CI. That works until it doesn't. The Operator pattern exists precisely for this problem: operational knowledge codified as a Kubernetes controller.

This article explains what operators are, how the controller pattern works, and when building one is actually worth it.


The Problem Operators Solve

Kubernetes is excellent at managing stateless resources. A Deployment with a defined replica count reaches steady state reliably. But the real world is full of resources that have a lifecycle, dependencies, and operational logic:

  • "After the database migration job succeeds, start the API server"

  • "If pod restart count exceeds N, alert and scale down"

  • "When a new tenant is created, provision a namespace, RBAC, and quota"

  • "Keep all three components (Deployment, Service, HPA) in sync as a unit"

None of these are expressible in native Kubernetes YAML. You can bolt them on with Helm hooks, GitOps pipelines, or CronJobs β€” but you're pulling operational logic out of the cluster and into external systems that don't have full visibility into cluster state.

Operators move operational logic back into the cluster where it belongs.


What Is a Kubernetes Operator?

A Kubernetes Operator is:

A custom controller that manages a custom resource (CRD), encoding domain-specific operational knowledge into a control loop.

Three things make up every operator:

Component
What it is

CRD (CustomResourceDefinition)

Extends the Kubernetes API with a new resource type

Custom Resource (CR)

An instance of the CRD β€” the desired state you declare

Controller

A Go process that watches CRs and reconciles current state toward desired state

The controller is the heart of it. It runs inside the cluster, watches resources, and continuously drives current state toward the declared desired state. This is the same mechanism Kubernetes itself uses for Deployments, ReplicaSets, and Services.


The Controller Pattern

Every Kubernetes reconciler operates on the same fundamental loop:

The key properties of this model:

Idempotent: Reconcile() can be called many times with the same input and it must produce the same result. You never assume the resource doesn't exist; you always check and create-or-update.

Level-triggered, not edge-triggered: The reconciler doesn't receive "what changed". It receives "reconcile this object". It must re-read the full state each time. This makes it resilient to missed events.

Eventual consistency: The controller doesn't need to achieve desired state in a single reconciliation. It can return a Result{Requeue: true} to schedule another pass.

The Reconcile Function Signature

In Go, using controller-runtime:

ctrl.Request carries only the namespaced name of the triggering object β€” not the object itself, and not what event occurred. The reconciler fetches the current state itself at the start of every call.


Operators vs Alternatives

Before building an operator, it's worth understanding what you're replacing:

Helm Charts

Helm manages templated YAML. It excels at packaging and versioning static resource sets. It falls short when:

  • Resources have inter-dependencies that need runtime decisions

  • You need to watch the cluster state and react to it

  • You want rollback logic based on pod health, not just manifest diffs

Operators and Helm are complementary. The operator manages the lifecycle; Helm can be used to install the operator itself.

GitOps (ArgoCD / Flux)

GitOps syncs a Git repo to a cluster. It's a deployment mechanism, not a resource manager. GitOps doesn't know about your application's health or lifecycle β€” it just applies YAML.

Operators live inside the cluster and can react to runtime events. GitOps manages what gets applied at the cluster level.

Shell Scripts / CronJobs

Scripts work. They're fine for simple one-way actions. They fall short as operational models because:

  • No retry semantics built in

  • No awareness of cluster state β€” they're external

  • Difficult to test in isolation

  • Don't participate in the Kubernetes RBAC and audit model

Operators

Operators are the right tool when:

  • You need stateful lifecycle management (create β†’ update β†’ teardown with specific behavior at each phase)

  • You're encoding operational runbooks as code

  • You want Kubernetes-native status visibility (kubectl describe, kubectl get events)


When to Build an Operator

Build an operator when:

1. You're managing stateful lifecycle across multiple resources You need to create 3+ resources together, and they have ordering or dependency constraints.

2. You encode operational knowledge that isn't in YAML "Scale down if error rate exceeds 5% for 10 minutes" β€” this is logic, not configuration.

3. You want Kubernetes-native observability Status conditions accessible via kubectl describe, events in kubectl get events.

4. You're building a platform abstraction Platform teams build operators to expose simplified primitives to application teams β€” AppDeployment, DatabaseInstance, TenantEnvironment.

5. Day-2 operations require automation Certificate rotation, database backup scheduling, secret rotation.


When NOT to Build an Operator

Operators carry real complexity cost. Don't build one when:

Simple YAML suffices If Helm or Kustomize covers your deployment model cleanly, don't add an operator.

You don't own the operational knowledge Operators encode decisions. If you'd be encoding someone else's undocumented runbook, the operator becomes a liability.

The team doesn't know Go Controllers are Go programs. If your team doesn't have Go familiarity, the maintenance burden is high.

You need it deployed in 2 days An operator is a software project. Plan accordingly.

A useful heuristic: if you'd write a CronJob or a post-deploy script to handle the operational logic, consider whether that logic belongs in a controller instead.


The Operator Framework Ecosystem

Three Go projects are fundamental to building Kubernetes operators:

controller-runtime

sigs.k8s.io/controller-runtime is the Go library most operators are built on. It provides:

  • Manager: starts controllers, handles leader election, serves health/metrics endpoints

  • Client: typed Kubernetes API client with caching

  • Reconciler interface: implement Reconcile() to define your logic

  • Status subresource handling

  • envtest: in-memory Kubernetes API server for testing

kubebuilder

kubebuilder is the CLI scaffolding tool from the Kubernetes SIGs team. It:

  • Initializes the project with go.mod, main.go, Makefile

  • Scaffolds API types (api/v1alpha1/appstack_types.go)

  • Scaffolds controllers (internal/controller/appstack_controller.go)

  • Generates CRDs from Go struct markers: make manifests

  • Generates deep copy methods: make generate

Operator SDK (optional)

Operator SDK wraps kubebuilder with additional tooling for OLM (Operator Lifecycle Manager) packaging. If you're publishing to OperatorHub or need OLM integration, it adds value. For internal operators, kubebuilder is sufficient.


The appstack-operator Project

Throughout this series, I build appstack-operator: an operator that introduces the AppStack CRD.

The problem it solves: every microservice I deploy needs a Deployment, a Service, and often an HPA. Keeping these three resources consistent β€” using the same image, the same port, the same labels β€” requires either copy-paste YAML or Helm templating that becomes opaque. I wanted to write:

And have the cluster manage the rest. That's what the operator does.

This is intentionally simple. Learning the controller pattern on a complex domain is harder. AppStack is simple enough to fit in your head while still covering every important operator concept: CRD design, reconcile logic, status conditions, owner references, testing, and deployment.


What I Learned Building My First Operator

A few things I had wrong before I wrote my first operator:

1. "The Reconcile function is called on every change" Not quite. It's called with the name of an object. The function must fetch the object's current state. It's also called on requeue, not just on changes.

2. "I can trust that the cluster state matches what I set last time" No β€” resources can be deleted, labels changed, or the operator can restart mid-reconcile. Always re-fetch and compare. Never assume.

3. "I only need to watch my CRD" You also need to watch owned resources. If someone deletes the Service your operator created, you want the controller to recreate it. This requires setting owner references and configuring watches on owned types.

4. "Status update is just another patch" Status has its own subresource. Updating .spec and .status require separate API calls. Getting this wrong leads to silent failures or status that never updates.

5. "Testing an operator requires a full cluster" envtest spins up a real API server and etcd binary locally. Your controller integration tests run against real API machinery, no cluster needed.

These are covered in detail in the articles that follow.


Next: Project Setup with kubebuilder β†’

Last updated