RBAC, Deployment, and Production Hardening
Table of Contents
Introduction
Getting an operator reconciling correctly in a kind cluster is the halfway point. Running it in production means: the right RBAC permissions, a minimal container image, leader election to avoid split-brain in multi-replica deployments, and proper hardening.
This article covers deploying appstack-operator to a real cluster and the changes that make it safe to run in production.
RBAC Markers and Generated Roles
The // +kubebuilder:rbac: markers in your controller file are how the operator declares the permissions it needs. make manifests reads them and generates config/rbac/role.yaml.
All RBAC markers in appstack_controller.go:
After make manifests, config/rbac/role.yaml contains:
ClusterRole vs Role: The scaffold generates a ClusterRole by default because controllers watch resources across all namespaces. If your operator is namespace-scoped (only watches resources in one namespace), you can restrict it to a Role β but most operators use ClusterRole bound with a ClusterRoleBinding.
Principle of Least Privilege
Only request the verbs you actually use:
If the controller never deletes a resource directly (relying on owner-reference GC instead), remove
deletefrom that resourceNever request
*(all verbs) β it makes auditing impossibleAvoid
secretsaccess unless absolutely necessary; preferConfigMapfor non-sensitive config
Deploying to the Cluster
Build the Container Image
The generated Dockerfile is production-ready:
The distroless base image is important:
No shell means no shell injection attacks
USER 65532:65532(nonroot) β the controller runs as a non-root user
Build and push:
For multi-arch builds (ARM64 + AMD64):
Deploy with make deploy
This runs kustomize build config/default | kubectl apply -f -, which applies:
CRDs
Namespace (
appstack-system)ServiceAccount
ClusterRole + ClusterRoleBinding
Manager Deployment
Verify:
Check the Controller Logs
Apply a test CR:
The Operator Container Image
Version Tagging
Don't use latest for operator images in production. Use immutable semver tags:
The Deployment in config/manager/manager.yaml references the image:
kubebuilder sets imagePullPolicy: Always by default for latest. For versioned tags, set imagePullPolicy: IfNotPresent to avoid unnecessary pulls.
Signing Images
For production, sign your images with cosign:
Leader Election for High Availability
Running a single operator replica is a single point of failure. A crashed pod means no reconciliation until it restarts. Running multiple replicas without coordination causes split-brain: two controllers reconciling the same resource simultaneously and overwriting each other's writes.
Leader election solves this. Only the leader pod actively reconciles. Follower pods watch the lease but don't act. If the leader dies, a follower acquires the lease within seconds.
Enable it in cmd/main.go:
And pass --leader-elect=true to the manager binary (set in the Deployment args):
The lease object is stored in a Lease resource in the operator namespace:
With leader election, you can run 2+ replicas:
Replicas beyond 2 don't add redundancy value (the lease still only has one holder). 2 replicas provides failover.
RBAC for leader election: The controller needs permission to manage Lease objects. The scaffold includes this:
Resource Limits and Security Context
The generated Deployment has placeholder resource limits. Set them based on actual usage observed during development:
For a controller with a small number of watched objects (hundreds, not thousands), 500m CPU and 128Mi memory is generous. Controller-runtime has an efficient cache β memory use is proportional to the number of cached objects.
Security Context
The Dockerfile already runs as nonroot. Mirror this in the pod spec:
Set this in config/manager/manager.yaml. These settings align with the Pod Security Standards restricted policy, which is enforced in most hardened clusters.
Health Probes
The manager exposes health endpoints at :8081:
GET /healthzβ liveness probe (returns 200 if the manager goroutine is alive)GET /readyzβ readiness probe (returns 200 when the cache is synced and the manager is ready to reconcile)
These are already registered in main.go by the scaffold:
The Deployment configures the probes:
The readiness probe is critical. When the pod starts, the controller-runtime cache needs to sync all watched resources before the controller starts reconciling. The readiness probe ensures traffic doesn't route to the pod until the cache is warm.
Webhook Validation (Optional)
For stricter validation than kubebuilder markers allow, implement a validating webhook. This lets you write Go code that runs when a CR is applied and rejects invalid resources before they reach the controller.
Scaffold a webhook:
This generates api/v1alpha1/appstack_webhook.go. Implement the ValidateCreate, ValidateUpdate, and ValidateDelete methods:
Webhooks require TLS certificates. In production, use cert-manager:
Uncomment the cert-manager integration in config/default/kustomization.yaml. This wires up certificate generation and injection automatically.
Production Checklist
Before running appstack-operator in a production cluster:
RBAC
Container
Deployment
Observability
Operations
Previous: Testing with envtest β Series start: README β
Last updated