Article 8: Observability — Making the Agent Itself Observable

Introduction

The agent monitors your cluster. But who monitors the agent?

This is not a rhetorical question. If the watch-loop stops polling, you won't know. If the RCA engine starts failing silently, alerts will go out without analyses and you won't notice until you check. If Redis goes down and cooldowns stop working, you'll get notification floods. If the approval timeout logic has a bug, approvals silently expire and playbooks never run.

An AIOps agent that isn't instrumented is a black box running next to your infrastructure. I treated observability as a first-class requirement for simple-ai-agent — not something to add later. This article covers src/monitoring/prometheus.py, structured logging with structlog, and the /health endpoint design.

Observability Goals

There are four questions I need to be able to answer about the agent at any time:

Is it alive? Is the process running, are background tasks alive, can it reach its dependencies?
Is it working? Is the watch-loop polling? Are events being processed? Are notifications being sent?
Is it performing? How long do RCA calls take? Are there approval request backlogs?
Is it accurate? How often does it fire per rule? How often are approvals rejected vs. accepted?

Those four questions map directly to the three observability signals I use: structured logs (what happened), metrics (counts and timing), and health endpoints (current state).

Structured Logging with structlog

All logging in simple-ai-agent uses structlog with JSON output. Every log line is a JSON object with standard fields:

# src/core/logging.py
import structlog
import logging

def configure_logging(log_level: str = "INFO") -> None:
    structlog.configure(
        processors=[
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.BoundLogger,
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
    )
    
    # Also configure stdlib logging to go through structlog
    logging.basicConfig(
        format="%(message)s",
        stream=sys.stdout,
        level=getattr(logging, log_level.upper()),
    )

Log Fields Convention

Every log call names the event with a dotted path that identifies the component and action:

# Component prefix identifies the subsystem
log.info("watchloop.poll.start", interval=30, namespaces=["production", "staging"])
log.info("watchloop.event.detected", event_type="crash_loop", namespace="production", resource="api-server-abc")
log.info("rule_engine.matched", rule="crash-loop-production", severity="critical")
log.info("approval.requested", approval_id="a3f7", playbook="restart-crashed-pod")
log.info("approval.resolved", approval_id="a3f7", decision="approved", elapsed_s=47.3)
log.info("rca.complete", confidence="medium", duration_ms=8423, model="claude-3-5-sonnet-20241022")
log.info("notifier.sent", channel="telegram", chars=312, message_id=1842)

This naming convention makes grep and log query tools useful:

# All RCA calls in the last hour
journalctl -u simple-ai-agent --since "1 hour ago" | grep '"event":"rca"'

# All critical events in production
journalctl -u simple-ai-agent | jq 'select(.severity=="critical" and .namespace=="production")'

# Average RCA duration
journalctl -u simple-ai-agent | jq 'select(.event=="rca.complete") | .duration_ms' | awk '{sum+=$1; count++} END {print sum/count}'

Context Binding

The RCA engine binds resource context to every log call within a request scope:

async def analyze(self, event: ClusterEvent) -> RCAResult:
    # Bind context once, all child log calls inherit it
    log = structlog.get_logger().bind(
        namespace=event.namespace,
        resource=event.resource_name,
        event_type=event.event_type,
    )
    
    log.info("rca.start")                    # includes namespace/resource/event_type
    # ... evidence collection ...
    log.info("rca.evidence.collected", items=len(evidence.pod_logs))
    log.info("rca.complete", confidence=result.confidence, duration_ms=duration_ms)

Prometheus Metrics

All metrics are defined in src/monitoring/prometheus.py and registered at application startup.

# src/monitoring/prometheus.py
from prometheus_client import Counter, Histogram, Gauge, Summary

# Watch-loop metrics
WATCHLOOP_POLL_TOTAL     = Counter("watchloop_polls_total", "Total watch-loop poll iterations")
WATCHLOOP_POLL_DURATION  = Histogram("watchloop_poll_duration_seconds",
                                     "Duration of each watch-loop poll",
                                     buckets=[0.5, 1, 2, 5, 10, 30])
WATCHLOOP_EVENTS_DETECTED= Counter("watchloop_events_detected_total",
                                    "Events detected by watch-loop",
                                    labelnames=["event_type", "namespace"])

# Rule engine metrics  
RULE_MATCHES             = Counter("rule_engine_matches_total",
                                    "Rule matches",
                                    labelnames=["rule", "severity"])
RULE_COOLDOWNS           = Counter("rule_engine_cooldowns_total",
                                    "Events skipped due to cooldown",
                                    labelnames=["rule"])

# RCA metrics
RCA_DURATION             = Histogram("rca_duration_milliseconds",
                                      "RCA call duration in milliseconds",
                                      buckets=[1000, 3000, 5000, 10000, 20000, 30000])
RCA_CONFIDENCE           = Counter("rca_confidence_total",
                                    "RCA results by confidence level",
                                    labelnames=["level"])
RCA_ERRORS               = Counter("rca_errors_total",
                                    "RCA failures",
                                    labelnames=["reason"])

# Approval metrics
APPROVALS_REQUESTED      = Counter("approvals_requested_total", "Approval requests created")
APPROVALS_RESOLVED       = Counter("approvals_resolved_total",
                                    "Approval requests resolved",
                                    labelnames=["decision"])  # approved / rejected / timeout
APPROVAL_WAIT_TIME       = Histogram("approval_wait_seconds",
                                      "Time from approval request to decision",
                                      buckets=[30, 60, 120, 300, 600])

# Notification metrics
NOTIFICATIONS_SENT       = Counter("notifications_sent_total",
                                    "Notifications sent",
                                    labelnames=["channel", "severity"])
NOTIFICATION_ERRORS      = Counter("notification_errors_total",
                                    "Notification delivery failures",
                                    labelnames=["channel"])

# Dependency health gauges
REDIS_CONNECTED          = Gauge("redis_connected", "Whether Redis is reachable (1=yes, 0=no)")
KUBERNETES_CONNECTED     = Gauge("kubernetes_connected", "Whether K8s API is reachable (1=yes, 0=no)")
LLM_API_CONNECTED        = Gauge("llm_api_connected", "Whether LLM API is reachable (1=yes, 0=no)")

Metrics Endpoint

FastAPI exposes metrics at /metrics using prometheus_client:

# src/main.py
from prometheus_client import make_asgi_app
from starlette.routing import Mount

# Mount Prometheus ASGI app at /metrics
app.mount("/metrics", make_asgi_app())

The /health Endpoint

The /health endpoint does more than return {"status": "ok"}. It checks the actual state of each subsystem and returns a structured response:

# src/api/health.py
from fastapi import APIRouter
from fastapi.responses import JSONResponse

router = APIRouter()

@router.get("/health")
async def health_check(request: Request) -> JSONResponse:
    state = request.app.state
    checks = {}
    overall_ok = True
    
    # Check Redis
    try:
        await state.redis.ping()
        checks["redis"] = {"status": "ok"}
        REDIS_CONNECTED.set(1)
    except Exception as e:
        checks["redis"] = {"status": "error", "detail": str(e)}
        REDIS_CONNECTED.set(0)
        overall_ok = False
    
    # Check Kubernetes API
    try:
        await state.k8s_client.ping()
        checks["kubernetes"] = {"status": "ok"}
        KUBERNETES_CONNECTED.set(1)
    except Exception as e:
        checks["kubernetes"] = {"status": "error", "detail": str(e)}
        KUBERNETES_CONNECTED.set(0)
        overall_ok = False
    
    # Check watch-loop task
    watchloop_task = state.watchloop_task
    if watchloop_task and not watchloop_task.done():
        checks["watchloop"] = {"status": "running"}
    else:
        checks["watchloop"] = {"status": "stopped"}
        overall_ok = False
    
    # Check LLM API (lightweight — just check if token is configured)
    if os.environ.get("ANTHROPIC_API_KEY"):
        checks["llm_api"] = {"status": "configured"}
    else:
        checks["llm_api"] = {"status": "no_token"}
        # Not fatal — RCA degrades gracefully without LLM
    
    # Pending approvals count
    pending_keys = await state.redis.keys("approval:*")
    checks["pending_approvals"] = {"count": len(pending_keys)}
    
    status_code = 200 if overall_ok else 503
    return JSONResponse(
        {
            "status": "ok" if overall_ok else "degraded",
            "checks": checks,
        },
        status_code=status_code,
    )

Example /health Response

{
  "status": "ok",
  "checks": {
    "redis":            {"status": "ok"},
    "kubernetes":       {"status": "ok"},
    "watchloop":        {"status": "running"},
    "llm_api":          {"status": "configured"},
    "pending_approvals": {"count": 1}
  }
}

A degraded state (503) means Kubernetes is unreachable — the watch-loop will keep running but will log errors on every poll. Redis down means cooldowns and approvals are non-functional. Neither stops the process, but they should generate their own alerts via Alertmanager monitoring the /health HTTP endpoint.

Grafana Dashboard Overview

The Grafana dashboard at config/grafana/dashboards/aiops-agent.json has four panels:

Panel 1: Watch-Loop Activity

# Query: events detected per minute by type
rate(watchloop_events_detected_total[5m]) by (event_type)

This shows whether the agent is seeing events. If this drops to 0 for more than 30 minutes, either the cluster is unusually healthy or the watch-loop has stalled.

Panel 2: Rule Engine Firing Rate

# Query: rule matches per minute by rule name
rate(rule_engine_matches_total[5m]) by (rule)

Useful for identifying which rules are most active. If crash-loop-production is firing constantly, that's a signal to look at the application, not the agent.

Panel 3: RCA Latency

# Query: 50th and 95th percentile RCA duration
histogram_quantile(0.95, rate(rca_duration_milliseconds_bucket[10m]))
histogram_quantile(0.50, rate(rca_duration_milliseconds_bucket[10m]))

p95 above 15 seconds usually means the LLM API is under load. p95 above 30 seconds may indicate a timeout issue.

Panel 4: Approval Funnel

# Query: approvals requested vs resolved by decision
increase(approvals_requested_total[24h])
increase(approvals_resolved_total{decision="approved"}[24h])
increase(approvals_resolved_total{decision="rejected"}[24h])
increase(approvals_resolved_total{decision="timeout"}[24h])

High timeout rate means the approval window is too short or notifications aren't reaching the operator. High rejection rate is worth reviewing to understand whether the playbooks are triggering inappropriately.

Debugging Common Failure Modes

Watch-loop stopped polling

Symptom: watchloop_polls_total counter is flat for >5 minutes.

Diagnosis:

# Check if the task is alive
curl -s http://localhost:8000/health | jq '.checks.watchloop'
# Expected: {"status": "running"}

# If stopped, check recent error logs
journalctl -u simple-ai-agent | grep '"event":"watchloop' | tail -20

Most common cause: Unhandled exception that escaped the try/except in run_watchloop(). Should not happen with proper error containment, but the guard is there.

RCA calls never complete

Symptom: rca_errors_total{reason="timeout"} increasing.

Diagnosis:

# Check LLM API reachability
curl -s http://localhost:8000/health | jq '.checks.llm_api'

# Check recent RCA durations
journalctl -u simple-ai-agent | jq 'select(.event=="rca.complete") | .duration_ms' | tail -10

Most common cause: ANTHROPIC_API_KEY is invalid or rate-limited. Check the Anthropic console usage page at console.anthropic.com.

Notification floods (cooldown not working)

Symptom: Same alert firing every 30 seconds in Telegram.

Diagnosis:

# Check Redis connectivity
curl -s http://localhost:8000/health | jq '.checks.redis'

# Check if cooldown keys are being set
redis-cli KEYS "cooldown:*"
# If empty, cooldowns aren't being applied

Most common cause: Redis connection failure. Cooldown keys fail to write silently when Redis is down.

Approval requests timing out silently

Symptom: approvals_resolved_total{decision="timeout"} increasing, notifications sent but no followup.

Diagnosis:

# List active pending approvals
redis-cli KEYS "approval:*"
redis-cli HGETALL "approval:<id>"

# Check notification delivery
journalctl -u simple-ai-agent | grep '"event":"notifier'

Most common cause: The approval notification was sent but not seen (wrong channel config, muted conversation). Or the approval ID appeared but the operator didn't recognize the format.

What I Learned

Instrument the negative paths. I instrumented the happy path first: rule_matched, rca_complete, approval_approved. The gaps appeared in incidents: rule_engine_cooldowns_total was not instrumented initially, so I had no way to know whether events were being skipped because of cooldowns or because they weren't matching any rule.

The /health endpoint should reflect actual functionality, not just process uptime. A process can be running with a dead watch-loop. I've had asyncio.CancelledError bubble up and silently kill the background watch-loop task while the HTTP server kept serving requests. Checking watchloop_task.done() in /health caught this.

Structlog's context binding makes distributed debugging tractable. Being able to grep '"namespace":"production","resource":"api-server"' and see every log line from every component that touched that resource in the order it happened is genuinely useful. Without the consistent field naming, I'd be guessing.

Approval metrics revealed a real misconfiguration. After running for two weeks, I noticed approvals_resolved_total{decision="timeout"} was 100% of all approvals. Every approval timed out. The reason: I had the Telegram bot's webhook set to a domain that was unreachable from the internet, so approval notifications were never delivered. The metrics caught this; without them I would have assumed the system was working because the process was running.

Wrapping Up the Series

This completes the AIOps 101 series. The eight articles cover the complete lifecycle of how simple-ai-agent detects, classifies, diagnoses, and responds to Kubernetes cluster anomalies:

What is AIOps? — The gap between alerts and resolution
Architecture — Layered design and stack decisions
The Watch-Loop — Background cluster polling
The Rule Engine — YAML rules and severity routing
Playbooks and Human-in-the-Loop Approvals — Risk gating and Redis TTL approvals
LLM-Powered RCA — Evidence collection and SRE prompt design
Alertmanager Webhook Integration — Bridging Prometheus alerts into the agent pipeline
Observability (this article) — Structlog, Prometheus metrics, and /health

The full project is at github.com/Htunn/simple-ai-agent.

PreviousArticle 7: Alertmanager Webhook Integration NextRAG 101

Last updated 20 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagObservability Goals

hashtagStructured Logging with structlog

hashtagLog Fields Convention

hashtagContext Binding

hashtagPrometheus Metrics

hashtagMetrics Endpoint

hashtagThe /health Endpoint

hashtagExample /health Response

hashtagGrafana Dashboard Overview

hashtagPanel 1: Watch-Loop Activity

hashtagPanel 2: Rule Engine Firing Rate

hashtagPanel 3: RCA Latency

hashtagPanel 4: Approval Funnel

hashtagDebugging Common Failure Modes

hashtagWatch-loop stopped polling

hashtagRCA calls never complete

hashtagNotification floods (cooldown not working)

hashtagApproval requests timing out silently

hashtagWhat I Learned

hashtagWrapping Up the Series

Introduction

Table of Contents

Observability Goals

Structured Logging with structlog

Log Fields Convention

Context Binding

Prometheus Metrics

Metrics Endpoint

The /health Endpoint

Example /health Response

Grafana Dashboard Overview

Panel 1: Watch-Loop Activity

Panel 2: Rule Engine Firing Rate

Panel 3: RCA Latency

Panel 4: Approval Funnel

Debugging Common Failure Modes

Watch-loop stopped polling

RCA calls never complete

Notification floods (cooldown not working)

Approval requests timing out silently

What I Learned

Wrapping Up the Series