Article 3: The Watch-Loop — Background Cluster Polling
Introduction
The watch-loop is the heartbeat of the AIOps engine. Everything else — the rule engine, playbooks, RCA — is reactive: it processes events fed to it. The watch-loop is the source of those events. It runs forever, independently of user interaction, polling the Kubernetes API at a configurable interval and converting cluster state into ClusterEvent objects that the rest of the system can act on.
This article covers the implementation in src/monitoring/watchloop.py and src/k8s/client.py from simple-ai-agent.
Table of Contents
Why a Poll-Based Watch-Loop?
Kubernetes has a watch API (kubectl get pods --watch) that streams events as they happen. This is more efficient than polling — instead of asking "what's the state now?" every 30 seconds, you receive changes immediately.
I chose polling for a few reasons specific to my use case:
Homelab reliability. My cluster runs on hardware that occasionally has network hiccups. A long-running watch stream that drops silently leaves me with no events and no error. A polling loop with error logging fails visibly.
Simplicity. The watch stream requires reconnection logic, bookmark handling, and careful error handling to avoid missing events. A polling loop is a
while True:with asleep. In a personal project, I'll take the simpler implementation.Idempotency. Each poll observes the current state. If I missed a
CrashLoopBackOffevent because the watch connection dropped, the pod is still inCrashLoopBackOffon the next poll and I'll catch it then. With a push-based watch, missed events are missed permanently.
For a large production cluster with thousands of pods, the polling overhead would be meaningful. For a homelab with tens of pods, a 30-second poll against the local Kubernetes API costs essentially nothing.
ClusterEvent: The Common Data Model
Before writing any detection logic, I defined the data model that everything downstream consumes. This is the contract between the watch-loop and the rule engine.
The metadata and raw_data fields carry enough context for the RCA engine to work with. A CRASH_LOOP event includes the restart count, the last termination reason, and recent log lines. An OOM_KILLED event includes the memory limit, peak usage, and restart history.
Every ClusterEvent is persisted to PostgreSQL so the RCA engine can query event history when building its analysis.
What We Watch For
The current watch-loop detects four event classes from the Kubernetes API:
1. CrashLoopBackOff
This is the most common operational event in my clusters. It usually means the application crashed on startup — misconfigured environment variable, failed database migration, missing dependency. The watch-loop detects it within one polling interval (max 30 seconds).
2. OOMKilled
OOMKilled is different from CrashLoopBackOff — the kernel killed the container because it exceeded its memory limit. The fix is different (increase the limit or fix the memory leak) and the RCA approach is different (look at memory usage trends, not application crash logs).
3. NotReady Nodes
Node issues are more severe than pod issues — they affect all workloads running on the node. The grace period prevents false positives from brief maintenance operations or node restarts.
4. Zero-Replica Deployments
This catches deployments where all pods are either pending or terminating — often a resource pressure situation or a node affinity/taint issue.
The Watch-Loop Implementation
The watch-loop is an async function that runs as a background task from src/main.py. Here's the structure:
Detection Functions
Each detection function is a pure function: given a list of Kubernetes API objects, return a list of ClusterEvent objects. They're independently testable.
Deduplication
Naively, the same CrashLoopBackOff pod would fire a ClusterEvent every 30 seconds indefinitely. The rule engine handles deduplication with a cooldown window — but the watch-loop also tracks which events it has already seen using a simple in-memory set of (event_type, namespace, resource_name) tuples, reset periodically.
Kubernetes API Client
src/k8s/client.py wraps the kubernetes-asyncio library:
The client tries in-cluster config first (for when the agent runs inside Kubernetes) and falls back to kubeconfig (for development or Docker Compose). In my homelab, the agent runs in Docker Compose with the kubeconfig mounted at ./data/kube/config.
Error Containment and Graceful Degradation
The watch-loop has three error scenarios to handle correctly:
1. Kubernetes API Unavailable
If the Kubernetes API is down (cluster restart, network issue), the poll will fail. The watch-loop logs the error with full context, increments the error counter, and sleeps until the next interval. It does not crash. It does not stop.
2. Rule Engine Failure
If the rule engine fails to process an event (bug in rule matching, unexpected event shape), the watch-loop catches the exception per-event and continues processing the remaining events.
3. Application Shutdown
When FastAPI shuts down, the lifespan context manager cancels the watch-loop task. The asyncio.CancelledError is re-raised (not swallowed), allowing the task to exit cleanly. The watch-loop logs watchloop.cancelled on shutdown.
Configuration
The notification channel format is <channel_type>:<id> — e.g., telegram:123456789 or slack:C123456789. The watch-loop uses this to route AIOps notifications to the right chat destination.
Setting AUTO_REMEDIATION_ENABLED=true only skips approvals for LOW risk playbook steps. MEDIUM and HIGH risk steps always require explicit approval regardless.
What I Learned from Running This
Poll interval matters less than you think for most events. A 30-second detection delay is usually fine. By the time Kubernetes marks a pod as CrashLoopBackOff, it has already gone through its restart backoff. Another 30 seconds doesn't meaningfully change the incident timeline.
The dedup window needs to survive RCA time. The RCA engine takes 5–15 seconds to run (LLM call). If the dedup window resets while RCA is running, the same event can fire twice before the first remediation completes. I learned this the hard way by getting duplicate approval requests for the same pod. The 5-minute dedup window is the fix.
Kubernetes API calls fail more than you'd expect. Even in a healthy cluster, the API server occasionally returns 429s or 503s under load. kubernetes-asyncio has retry logic, but the watch-loop needs its own outer retry/continue behavior. A single failed poll should not disrupt the loop — just log it and try again.
Don't watch every namespace by default. In a cluster with many namespaces including system namespaces, watching everything generates noise. Consider filtering to application namespaces:
Next: Article 4 — The Rule Engine: Turning Events into Actionable Alerts
Last updated