# Article 8: Observability — Making the Agent Itself Observable

## Introduction

The agent monitors your cluster. But who monitors the agent?

This is not a rhetorical question. If the watch-loop stops polling, you won't know. If the RCA engine starts failing silently, alerts will go out without analyses and you won't notice until you check. If Redis goes down and cooldowns stop working, you'll get notification floods. If the approval timeout logic has a bug, approvals silently expire and playbooks never run.

An AIOps agent that isn't instrumented is a black box running next to your infrastructure. I treated observability as a first-class requirement for `simple-ai-agent` — not something to add later. This article covers `src/monitoring/prometheus.py`, structured logging with `structlog`, and the `/health` endpoint design.

***

## Table of Contents

1. [Observability Goals for the Agent](#observability-goals)
2. [Structured Logging with structlog](#structured-logging)
3. [Prometheus Metrics](#prometheus-metrics)
4. [The /health Endpoint](#the-health-endpoint)
5. [Grafana Dashboard Overview](#grafana-dashboard-overview)
6. [Debugging Common Failure Modes](#debugging-common-failure-modes)
7. [What I Learned](#what-i-learned)

***

## Observability Goals

There are four questions I need to be able to answer about the agent at any time:

1. **Is it alive?** Is the process running, are background tasks alive, can it reach its dependencies?
2. **Is it working?** Is the watch-loop polling? Are events being processed? Are notifications being sent?
3. **Is it performing?** How long do RCA calls take? Are there approval request backlogs?
4. **Is it accurate?** How often does it fire per rule? How often are approvals rejected vs. accepted?

Those four questions map directly to the three observability signals I use: structured logs (what happened), metrics (counts and timing), and health endpoints (current state).

***

## Structured Logging with structlog

All logging in `simple-ai-agent` uses `structlog` with JSON output. Every log line is a JSON object with standard fields:

```python
# src/core/logging.py
import structlog
import logging

def configure_logging(log_level: str = "INFO") -> None:
    structlog.configure(
        processors=[
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.BoundLogger,
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
    )
    
    # Also configure stdlib logging to go through structlog
    logging.basicConfig(
        format="%(message)s",
        stream=sys.stdout,
        level=getattr(logging, log_level.upper()),
    )
```

### Log Fields Convention

Every log call names the event with a dotted path that identifies the component and action:

```python
# Component prefix identifies the subsystem
log.info("watchloop.poll.start", interval=30, namespaces=["production", "staging"])
log.info("watchloop.event.detected", event_type="crash_loop", namespace="production", resource="api-server-abc")
log.info("rule_engine.matched", rule="crash-loop-production", severity="critical")
log.info("approval.requested", approval_id="a3f7", playbook="restart-crashed-pod")
log.info("approval.resolved", approval_id="a3f7", decision="approved", elapsed_s=47.3)
log.info("rca.complete", confidence="medium", duration_ms=8423, model="claude-3-5-sonnet-20241022")
log.info("notifier.sent", channel="telegram", chars=312, message_id=1842)
```

This naming convention makes `grep` and log query tools useful:

```bash
# All RCA calls in the last hour
journalctl -u simple-ai-agent --since "1 hour ago" | grep '"event":"rca"'

# All critical events in production
journalctl -u simple-ai-agent | jq 'select(.severity=="critical" and .namespace=="production")'

# Average RCA duration
journalctl -u simple-ai-agent | jq 'select(.event=="rca.complete") | .duration_ms' | awk '{sum+=$1; count++} END {print sum/count}'
```

### Context Binding

The RCA engine binds resource context to every log call within a request scope:

```python
async def analyze(self, event: ClusterEvent) -> RCAResult:
    # Bind context once, all child log calls inherit it
    log = structlog.get_logger().bind(
        namespace=event.namespace,
        resource=event.resource_name,
        event_type=event.event_type,
    )
    
    log.info("rca.start")                    # includes namespace/resource/event_type
    # ... evidence collection ...
    log.info("rca.evidence.collected", items=len(evidence.pod_logs))
    log.info("rca.complete", confidence=result.confidence, duration_ms=duration_ms)
```

***

## Prometheus Metrics

All metrics are defined in `src/monitoring/prometheus.py` and registered at application startup.

```python
# src/monitoring/prometheus.py
from prometheus_client import Counter, Histogram, Gauge, Summary

# Watch-loop metrics
WATCHLOOP_POLL_TOTAL     = Counter("watchloop_polls_total", "Total watch-loop poll iterations")
WATCHLOOP_POLL_DURATION  = Histogram("watchloop_poll_duration_seconds",
                                     "Duration of each watch-loop poll",
                                     buckets=[0.5, 1, 2, 5, 10, 30])
WATCHLOOP_EVENTS_DETECTED= Counter("watchloop_events_detected_total",
                                    "Events detected by watch-loop",
                                    labelnames=["event_type", "namespace"])

# Rule engine metrics  
RULE_MATCHES             = Counter("rule_engine_matches_total",
                                    "Rule matches",
                                    labelnames=["rule", "severity"])
RULE_COOLDOWNS           = Counter("rule_engine_cooldowns_total",
                                    "Events skipped due to cooldown",
                                    labelnames=["rule"])

# RCA metrics
RCA_DURATION             = Histogram("rca_duration_milliseconds",
                                      "RCA call duration in milliseconds",
                                      buckets=[1000, 3000, 5000, 10000, 20000, 30000])
RCA_CONFIDENCE           = Counter("rca_confidence_total",
                                    "RCA results by confidence level",
                                    labelnames=["level"])
RCA_ERRORS               = Counter("rca_errors_total",
                                    "RCA failures",
                                    labelnames=["reason"])

# Approval metrics
APPROVALS_REQUESTED      = Counter("approvals_requested_total", "Approval requests created")
APPROVALS_RESOLVED       = Counter("approvals_resolved_total",
                                    "Approval requests resolved",
                                    labelnames=["decision"])  # approved / rejected / timeout
APPROVAL_WAIT_TIME       = Histogram("approval_wait_seconds",
                                      "Time from approval request to decision",
                                      buckets=[30, 60, 120, 300, 600])

# Notification metrics
NOTIFICATIONS_SENT       = Counter("notifications_sent_total",
                                    "Notifications sent",
                                    labelnames=["channel", "severity"])
NOTIFICATION_ERRORS      = Counter("notification_errors_total",
                                    "Notification delivery failures",
                                    labelnames=["channel"])

# Dependency health gauges
REDIS_CONNECTED          = Gauge("redis_connected", "Whether Redis is reachable (1=yes, 0=no)")
KUBERNETES_CONNECTED     = Gauge("kubernetes_connected", "Whether K8s API is reachable (1=yes, 0=no)")
LLM_API_CONNECTED        = Gauge("llm_api_connected", "Whether LLM API is reachable (1=yes, 0=no)")
```

### Metrics Endpoint

FastAPI exposes metrics at `/metrics` using `prometheus_client`:

```python
# src/main.py
from prometheus_client import make_asgi_app
from starlette.routing import Mount

# Mount Prometheus ASGI app at /metrics
app.mount("/metrics", make_asgi_app())
```

***

## The /health Endpoint

The `/health` endpoint does more than return `{"status": "ok"}`. It checks the actual state of each subsystem and returns a structured response:

```python
# src/api/health.py
from fastapi import APIRouter
from fastapi.responses import JSONResponse

router = APIRouter()

@router.get("/health")
async def health_check(request: Request) -> JSONResponse:
    state = request.app.state
    checks = {}
    overall_ok = True
    
    # Check Redis
    try:
        await state.redis.ping()
        checks["redis"] = {"status": "ok"}
        REDIS_CONNECTED.set(1)
    except Exception as e:
        checks["redis"] = {"status": "error", "detail": str(e)}
        REDIS_CONNECTED.set(0)
        overall_ok = False
    
    # Check Kubernetes API
    try:
        await state.k8s_client.ping()
        checks["kubernetes"] = {"status": "ok"}
        KUBERNETES_CONNECTED.set(1)
    except Exception as e:
        checks["kubernetes"] = {"status": "error", "detail": str(e)}
        KUBERNETES_CONNECTED.set(0)
        overall_ok = False
    
    # Check watch-loop task
    watchloop_task = state.watchloop_task
    if watchloop_task and not watchloop_task.done():
        checks["watchloop"] = {"status": "running"}
    else:
        checks["watchloop"] = {"status": "stopped"}
        overall_ok = False
    
    # Check LLM API (lightweight — just check if token is configured)
    if os.environ.get("ANTHROPIC_API_KEY"):
        checks["llm_api"] = {"status": "configured"}
    else:
        checks["llm_api"] = {"status": "no_token"}
        # Not fatal — RCA degrades gracefully without LLM
    
    # Pending approvals count
    pending_keys = await state.redis.keys("approval:*")
    checks["pending_approvals"] = {"count": len(pending_keys)}
    
    status_code = 200 if overall_ok else 503
    return JSONResponse(
        {
            "status": "ok" if overall_ok else "degraded",
            "checks": checks,
        },
        status_code=status_code,
    )
```

### Example /health Response

```json
{
  "status": "ok",
  "checks": {
    "redis":            {"status": "ok"},
    "kubernetes":       {"status": "ok"},
    "watchloop":        {"status": "running"},
    "llm_api":          {"status": "configured"},
    "pending_approvals": {"count": 1}
  }
}
```

A degraded state (503) means Kubernetes is unreachable — the watch-loop will keep running but will log errors on every poll. Redis down means cooldowns and approvals are non-functional. Neither stops the process, but they should generate their own alerts via Alertmanager monitoring the `/health` HTTP endpoint.

***

## Grafana Dashboard Overview

The Grafana dashboard at `config/grafana/dashboards/aiops-agent.json` has four panels:

### Panel 1: Watch-Loop Activity

```
# Query: events detected per minute by type
rate(watchloop_events_detected_total[5m]) by (event_type)
```

This shows whether the agent is seeing events. If this drops to 0 for more than 30 minutes, either the cluster is unusually healthy or the watch-loop has stalled.

### Panel 2: Rule Engine Firing Rate

```
# Query: rule matches per minute by rule name
rate(rule_engine_matches_total[5m]) by (rule)
```

Useful for identifying which rules are most active. If `crash-loop-production` is firing constantly, that's a signal to look at the application, not the agent.

### Panel 3: RCA Latency

```
# Query: 50th and 95th percentile RCA duration
histogram_quantile(0.95, rate(rca_duration_milliseconds_bucket[10m]))
histogram_quantile(0.50, rate(rca_duration_milliseconds_bucket[10m]))
```

p95 above 15 seconds usually means the LLM API is under load. p95 above 30 seconds may indicate a timeout issue.

### Panel 4: Approval Funnel

```
# Query: approvals requested vs resolved by decision
increase(approvals_requested_total[24h])
increase(approvals_resolved_total{decision="approved"}[24h])
increase(approvals_resolved_total{decision="rejected"}[24h])
increase(approvals_resolved_total{decision="timeout"}[24h])
```

High timeout rate means the approval window is too short or notifications aren't reaching the operator. High rejection rate is worth reviewing to understand whether the playbooks are triggering inappropriately.

***

## Debugging Common Failure Modes

### Watch-loop stopped polling

**Symptom**: `watchloop_polls_total` counter is flat for >5 minutes.

**Diagnosis**:

```bash
# Check if the task is alive
curl -s http://localhost:8000/health | jq '.checks.watchloop'
# Expected: {"status": "running"}

# If stopped, check recent error logs
journalctl -u simple-ai-agent | grep '"event":"watchloop' | tail -20
```

**Most common cause**: Unhandled exception that escaped the `try/except` in `run_watchloop()`. Should not happen with proper error containment, but the guard is there.

***

### RCA calls never complete

**Symptom**: `rca_errors_total{reason="timeout"}` increasing.

**Diagnosis**:

```bash
# Check LLM API reachability
curl -s http://localhost:8000/health | jq '.checks.llm_api'

# Check recent RCA durations
journalctl -u simple-ai-agent | jq 'select(.event=="rca.complete") | .duration_ms' | tail -10
```

**Most common cause**: `ANTHROPIC_API_KEY` is invalid or rate-limited. Check the Anthropic console usage page at console.anthropic.com.

***

### Notification floods (cooldown not working)

**Symptom**: Same alert firing every 30 seconds in Telegram.

**Diagnosis**:

```bash
# Check Redis connectivity
curl -s http://localhost:8000/health | jq '.checks.redis'

# Check if cooldown keys are being set
redis-cli KEYS "cooldown:*"
# If empty, cooldowns aren't being applied
```

**Most common cause**: Redis connection failure. Cooldown keys fail to write silently when Redis is down.

***

### Approval requests timing out silently

**Symptom**: `approvals_resolved_total{decision="timeout"}` increasing, notifications sent but no followup.

**Diagnosis**:

```bash
# List active pending approvals
redis-cli KEYS "approval:*"
redis-cli HGETALL "approval:<id>"

# Check notification delivery
journalctl -u simple-ai-agent | grep '"event":"notifier'
```

**Most common cause**: The approval notification was sent but not seen (wrong channel config, muted conversation). Or the approval ID appeared but the operator didn't recognize the format.

***

## What I Learned

**Instrument the negative paths.** I instrumented the happy path first: `rule_matched`, `rca_complete`, `approval_approved`. The gaps appeared in incidents: `rule_engine_cooldowns_total` was not instrumented initially, so I had no way to know whether events were being skipped because of cooldowns or because they weren't matching any rule.

**The /health endpoint should reflect actual functionality, not just process uptime.** A process can be running with a dead watch-loop. I've had `asyncio.CancelledError` bubble up and silently kill the background watch-loop task while the HTTP server kept serving requests. Checking `watchloop_task.done()` in `/health` caught this.

**Structlog's context binding makes distributed debugging tractable.** Being able to `grep '"namespace":"production","resource":"api-server"'` and see every log line from every component that touched that resource in the order it happened is genuinely useful. Without the consistent field naming, I'd be guessing.

**Approval metrics revealed a real misconfiguration.** After running for two weeks, I noticed `approvals_resolved_total{decision="timeout"}` was 100% of all approvals. Every approval timed out. The reason: I had the Telegram bot's webhook set to a domain that was unreachable from the internet, so approval notifications were never delivered. The metrics caught this; without them I would have assumed the system was working because the process was running.

***

## Wrapping Up the Series

This completes the AIOps 101 series. The eight articles cover the complete lifecycle of how `simple-ai-agent` detects, classifies, diagnoses, and responds to Kubernetes cluster anomalies:

1. [What is AIOps?](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-what-is-aiops) — The gap between alerts and resolution
2. [Architecture](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-architecture) — Layered design and stack decisions
3. [The Watch-Loop](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-watch-loop) — Background cluster polling
4. [The Rule Engine](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-rule-engine) — YAML rules and severity routing
5. [Playbooks and Human-in-the-Loop Approvals](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-playbooks-human-in-the-loop) — Risk gating and Redis TTL approvals
6. [LLM-Powered RCA](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-rca-engine) — Evidence collection and SRE prompt design
7. [Alertmanager Webhook Integration](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-alertmanager-integration) — Bridging Prometheus alerts into the agent pipeline
8. **Observability** (this article) — Structlog, Prometheus metrics, and /health

The full project is at [github.com/Htunn/simple-ai-agent](https://github.com/Htunn/simple-ai-agent).