Article 8: Observability — Making the Agent Itself Observable
Introduction
The agent monitors your cluster. But who monitors the agent?
This is not a rhetorical question. If the watch-loop stops polling, you won't know. If the RCA engine starts failing silently, alerts will go out without analyses and you won't notice until you check. If Redis goes down and cooldowns stop working, you'll get notification floods. If the approval timeout logic has a bug, approvals silently expire and playbooks never run.
An AIOps agent that isn't instrumented is a black box running next to your infrastructure. I treated observability as a first-class requirement for simple-ai-agent — not something to add later. This article covers src/monitoring/prometheus.py, structured logging with structlog, and the /health endpoint design.
Table of Contents
Observability Goals
There are four questions I need to be able to answer about the agent at any time:
Is it alive? Is the process running, are background tasks alive, can it reach its dependencies?
Is it working? Is the watch-loop polling? Are events being processed? Are notifications being sent?
Is it performing? How long do RCA calls take? Are there approval request backlogs?
Is it accurate? How often does it fire per rule? How often are approvals rejected vs. accepted?
Those four questions map directly to the three observability signals I use: structured logs (what happened), metrics (counts and timing), and health endpoints (current state).
Structured Logging with structlog
All logging in simple-ai-agent uses structlog with JSON output. Every log line is a JSON object with standard fields:
Log Fields Convention
Every log call names the event with a dotted path that identifies the component and action:
This naming convention makes grep and log query tools useful:
Context Binding
The RCA engine binds resource context to every log call within a request scope:
Prometheus Metrics
All metrics are defined in src/monitoring/prometheus.py and registered at application startup.
Metrics Endpoint
FastAPI exposes metrics at /metrics using prometheus_client:
The /health Endpoint
The /health endpoint does more than return {"status": "ok"}. It checks the actual state of each subsystem and returns a structured response:
Example /health Response
A degraded state (503) means Kubernetes is unreachable — the watch-loop will keep running but will log errors on every poll. Redis down means cooldowns and approvals are non-functional. Neither stops the process, but they should generate their own alerts via Alertmanager monitoring the /health HTTP endpoint.
Grafana Dashboard Overview
The Grafana dashboard at config/grafana/dashboards/aiops-agent.json has four panels:
Panel 1: Watch-Loop Activity
This shows whether the agent is seeing events. If this drops to 0 for more than 30 minutes, either the cluster is unusually healthy or the watch-loop has stalled.
Panel 2: Rule Engine Firing Rate
Useful for identifying which rules are most active. If crash-loop-production is firing constantly, that's a signal to look at the application, not the agent.
Panel 3: RCA Latency
p95 above 15 seconds usually means the LLM API is under load. p95 above 30 seconds may indicate a timeout issue.
Panel 4: Approval Funnel
High timeout rate means the approval window is too short or notifications aren't reaching the operator. High rejection rate is worth reviewing to understand whether the playbooks are triggering inappropriately.
Debugging Common Failure Modes
Watch-loop stopped polling
Symptom: watchloop_polls_total counter is flat for >5 minutes.
Diagnosis:
Most common cause: Unhandled exception that escaped the try/except in run_watchloop(). Should not happen with proper error containment, but the guard is there.
RCA calls never complete
Symptom: rca_errors_total{reason="timeout"} increasing.
Diagnosis:
Most common cause: ANTHROPIC_API_KEY is invalid or rate-limited. Check the Anthropic console usage page at console.anthropic.com.
Notification floods (cooldown not working)
Symptom: Same alert firing every 30 seconds in Telegram.
Diagnosis:
Most common cause: Redis connection failure. Cooldown keys fail to write silently when Redis is down.
Approval requests timing out silently
Symptom: approvals_resolved_total{decision="timeout"} increasing, notifications sent but no followup.
Diagnosis:
Most common cause: The approval notification was sent but not seen (wrong channel config, muted conversation). Or the approval ID appeared but the operator didn't recognize the format.
What I Learned
Instrument the negative paths. I instrumented the happy path first: rule_matched, rca_complete, approval_approved. The gaps appeared in incidents: rule_engine_cooldowns_total was not instrumented initially, so I had no way to know whether events were being skipped because of cooldowns or because they weren't matching any rule.
The /health endpoint should reflect actual functionality, not just process uptime. A process can be running with a dead watch-loop. I've had asyncio.CancelledError bubble up and silently kill the background watch-loop task while the HTTP server kept serving requests. Checking watchloop_task.done() in /health caught this.
Structlog's context binding makes distributed debugging tractable. Being able to grep '"namespace":"production","resource":"api-server"' and see every log line from every component that touched that resource in the order it happened is genuinely useful. Without the consistent field naming, I'd be guessing.
Approval metrics revealed a real misconfiguration. After running for two weeks, I noticed approvals_resolved_total{decision="timeout"} was 100% of all approvals. Every approval timed out. The reason: I had the Telegram bot's webhook set to a domain that was unreachable from the internet, so approval notifications were never delivered. The metrics caught this; without them I would have assumed the system was working because the process was running.
Wrapping Up the Series
This completes the AIOps 101 series. The eight articles cover the complete lifecycle of how simple-ai-agent detects, classifies, diagnoses, and responds to Kubernetes cluster anomalies:
What is AIOps? — The gap between alerts and resolution
Architecture — Layered design and stack decisions
The Watch-Loop — Background cluster polling
The Rule Engine — YAML rules and severity routing
Playbooks and Human-in-the-Loop Approvals — Risk gating and Redis TTL approvals
LLM-Powered RCA — Evidence collection and SRE prompt design
Alertmanager Webhook Integration — Bridging Prometheus alerts into the agent pipeline
Observability (this article) — Structlog, Prometheus metrics, and /health
The full project is at github.com/Htunn/simple-ai-agent.
Last updated