RBAC, Deployment, and Production Hardening

Table of Contents


Introduction

Getting an operator reconciling correctly in a kind cluster is the halfway point. Running it in production means: the right RBAC permissions, a minimal container image, leader election to avoid split-brain in multi-replica deployments, and proper hardening.

This article covers deploying appstack-operator to a real cluster and the changes that make it safe to run in production.


RBAC Markers and Generated Roles

The // +kubebuilder:rbac: markers in your controller file are how the operator declares the permissions it needs. make manifests reads them and generates config/rbac/role.yaml.

All RBAC markers in appstack_controller.go:

After make manifests, config/rbac/role.yaml contains:

ClusterRole vs Role: The scaffold generates a ClusterRole by default because controllers watch resources across all namespaces. If your operator is namespace-scoped (only watches resources in one namespace), you can restrict it to a Role β€” but most operators use ClusterRole bound with a ClusterRoleBinding.

Principle of Least Privilege

Only request the verbs you actually use:

  • If the controller never deletes a resource directly (relying on owner-reference GC instead), remove delete from that resource

  • Never request * (all verbs) β€” it makes auditing impossible

  • Avoid secrets access unless absolutely necessary; prefer ConfigMap for non-sensitive config


Deploying to the Cluster

Build the Container Image

The generated Dockerfile is production-ready:

The distroless base image is important:

  • No shell means no shell injection attacks

  • USER 65532:65532 (nonroot) β€” the controller runs as a non-root user

Build and push:

For multi-arch builds (ARM64 + AMD64):

Deploy with make deploy

This runs kustomize build config/default | kubectl apply -f -, which applies:

  • CRDs

  • Namespace (appstack-system)

  • ServiceAccount

  • ClusterRole + ClusterRoleBinding

  • Manager Deployment

Verify:

Check the Controller Logs

Apply a test CR:


The Operator Container Image

Version Tagging

Don't use latest for operator images in production. Use immutable semver tags:

The Deployment in config/manager/manager.yaml references the image:

kubebuilder sets imagePullPolicy: Always by default for latest. For versioned tags, set imagePullPolicy: IfNotPresent to avoid unnecessary pulls.

Signing Images

For production, sign your images with cosign:


Leader Election for High Availability

Running a single operator replica is a single point of failure. A crashed pod means no reconciliation until it restarts. Running multiple replicas without coordination causes split-brain: two controllers reconciling the same resource simultaneously and overwriting each other's writes.

Leader election solves this. Only the leader pod actively reconciles. Follower pods watch the lease but don't act. If the leader dies, a follower acquires the lease within seconds.

Enable it in cmd/main.go:

And pass --leader-elect=true to the manager binary (set in the Deployment args):

The lease object is stored in a Lease resource in the operator namespace:

With leader election, you can run 2+ replicas:

Replicas beyond 2 don't add redundancy value (the lease still only has one holder). 2 replicas provides failover.

RBAC for leader election: The controller needs permission to manage Lease objects. The scaffold includes this:


Resource Limits and Security Context

The generated Deployment has placeholder resource limits. Set them based on actual usage observed during development:

For a controller with a small number of watched objects (hundreds, not thousands), 500m CPU and 128Mi memory is generous. Controller-runtime has an efficient cache β€” memory use is proportional to the number of cached objects.

Security Context

The Dockerfile already runs as nonroot. Mirror this in the pod spec:

Set this in config/manager/manager.yaml. These settings align with the Pod Security Standards restricted policy, which is enforced in most hardened clusters.


Health Probes

The manager exposes health endpoints at :8081:

  • GET /healthz β€” liveness probe (returns 200 if the manager goroutine is alive)

  • GET /readyz β€” readiness probe (returns 200 when the cache is synced and the manager is ready to reconcile)

These are already registered in main.go by the scaffold:

The Deployment configures the probes:

The readiness probe is critical. When the pod starts, the controller-runtime cache needs to sync all watched resources before the controller starts reconciling. The readiness probe ensures traffic doesn't route to the pod until the cache is warm.


Webhook Validation (Optional)

For stricter validation than kubebuilder markers allow, implement a validating webhook. This lets you write Go code that runs when a CR is applied and rejects invalid resources before they reach the controller.

Scaffold a webhook:

This generates api/v1alpha1/appstack_webhook.go. Implement the ValidateCreate, ValidateUpdate, and ValidateDelete methods:

Webhooks require TLS certificates. In production, use cert-manager:

Uncomment the cert-manager integration in config/default/kustomization.yaml. This wires up certificate generation and injection automatically.


Production Checklist

Before running appstack-operator in a production cluster:

RBAC

Container

Deployment

Observability

Operations


Previous: Testing with envtest ← Series start: README ←

Last updated