# Article 7: Alertmanager Webhook Integration

## Introduction

The watch-loop is pull-based: the agent polls the Kubernetes API on a fixed interval. But Prometheus and Alertmanager have already built up years of alert rules that fire on real signals — CPU throttling, out-of-disk, response latency, deployment age, certificate expiry. Ignoring those signals and reinventing them in the watch-loop would be duplicate work.

The Alertmanager webhook integration is the bridge: Alertmanager pushes fired alerts to the agent as HTTP POST requests, and the agent translates them into `ClusterEvent` objects that flow through the same rule engine, playbook system, and RCA pipeline as watch-loop events.

This article covers `src/api/webhooks.py` and `config/alertmanager.yml` from [simple-ai-agent](https://github.com/Htunn/simple-ai-agent).

***

## Table of Contents

1. [Why Both Sources Coexist](#why-both-sources-coexist)
2. [Alertmanager Webhook Payload Format](#alertmanager-webhook-payload-format)
3. [Translating Alerts to ClusterEvents](#translating-alerts-to-clusterevents)
4. [Webhook Receiver Implementation](#webhook-receiver-implementation)
5. [Alertmanager Configuration](#alertmanager-configuration)
6. [Deduplication Across Sources](#deduplication-across-sources)
7. [Handling Resolved Alerts](#handling-resolved-alerts)
8. [Real Alert Rules from the Project](#real-alert-rules-from-the-project)
9. [What I Learned](#what-i-learned)

***

## Why Both Sources Coexist

The watch-loop and the Alertmanager webhook detect different things:

| Source       | What it detects                                                                        |
| ------------ | -------------------------------------------------------------------------------------- |
| Watch-loop   | Pod-level state (CrashLoopBackOff, OOMKilled, NotReady) detected by direct API polling |
| Alertmanager | Any metric-based threshold — CPU, memory, network, latency, disk, custom app metrics   |

A pod can be running and ready but consuming 95% of its CPU limit. The watch-loop won't see that. Alertmanager's `KubePodCPUThrottling` rule will. Conversely, Alertmanager's alert evaluation interval (often 1-5 minutes) means it may miss a crash-and-recover cycle that the watch-loop would catch.

Running both gives better coverage without requiring either source to do more than it's designed for.

***

## Alertmanager Webhook Payload Format

When Alertmanager fires an alert group to a webhook receiver, it sends a JSON POST with this structure:

```json
{
  "receiver": "simple-ai-agent",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "KubePodCrashLooping",
        "namespace": "production",
        "pod": "api-server-abc123-7f9b",
        "container": "api-server",
        "severity": "critical"
      },
      "annotations": {
        "summary": "Pod is crash looping",
        "description": "Pod production/api-server-abc123-7f9b has been restarting more than 1 time / 5 minutes."
      },
      "startsAt": "2024-01-15T03:22:00.000Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://prometheus:9090/graph?g0.expr=...",
      "fingerprint": "a3b4c5d6e7f8a1b2"
    }
  ],
  "groupLabels": {
    "namespace": "production"
  },
  "commonLabels": {
    "severity": "critical"
  },
  "version": "4",
  "externalURL": "http://alertmanager:9093"
}
```

Key fields:

* `alerts[].labels.alertname` — the Prometheus alert rule name
* `alerts[].labels.namespace` / `pod` / `node` — resource identifiers
* `alerts[].labels.severity` — from the Prometheus rule annotation
* `alerts[].status` — `"firing"` or `"resolved"`
* `alerts[].fingerprint` — Alertmanager's dedup identifier

***

## Translating Alerts to ClusterEvents

The translation layer maps Alertmanager label structures to `ClusterEvent` objects. Each Prometheus alert name maps to an `EventType`:

```python
# src/api/webhooks.py
from src.monitoring.watchloop import ClusterEvent, EventType, Severity
from datetime import datetime

# Mapping from Prometheus alertname to EventType
ALERT_TO_EVENT_TYPE: dict[str, str] = {
    "KubePodCrashLooping":             "crash_loop",
    "KubeContainerWaiting":            "crash_loop",
    "KubeOOMKillingSoon":              "oom_killed",
    "KubeNodeNotReady":                "not_ready_node",
    "KubeDeploymentReplicasMismatch":  "replication_failure",
    "KubePodNotReady":                 "crash_loop",  # Conservative mapping
    "KubeCPUThrottling":               "resource_pressure",
    "KubePersistentVolumeFillingUp":   "storage_pressure",
}

SEVERITY_MAP: dict[str, str] = {
    "critical": "critical",
    "warning":  "high",
    "info":     "low",
}

def translate_alert(alert: dict) -> ClusterEvent | None:
    labels      = alert.get("labels", {})
    annotations = alert.get("annotations", {})
    alertname   = labels.get("alertname", "")
    
    event_type = ALERT_TO_EVENT_TYPE.get(alertname)
    if not event_type:
        # Unknown alert — log and skip translation
        return None
    
    namespace     = labels.get("namespace", "default")
    resource_name = (
        labels.get("pod") or
        labels.get("deployment") or
        labels.get("node") or
        labels.get("alertname")  # Fallback
    )
    
    severity = SEVERITY_MAP.get(labels.get("severity", "warning"), "high")
    
    return ClusterEvent(
        event_type=event_type,
        namespace=namespace,
        resource_name=resource_name,
        severity=severity,
        source="alertmanager",
        detected_at=datetime.utcnow(),
        metadata={
            "alertname":    alertname,
            "fingerprint":  alert.get("fingerprint"),
            "description":  annotations.get("description", ""),
            "generator_url": alert.get("generatorURL", ""),
            # Carry over raw labels for the RCA engine
            "labels":       labels,
        }
    )
```

### The `source` Field

`ClusterEvent` includes a `source` field set to `"alertmanager"` for webhook-originating events and `"watchloop"` for poll-originating events. This surfaces in logs, metrics, and the RCA context — it matters because the evidence collection strategy differs slightly. For Alertmanager events, the `generatorURL` points to the exact Prometheus query that fired, which is useful context for the RCA prompt.

***

## Webhook Receiver Implementation

```python
# src/api/webhooks.py
from fastapi import APIRouter, Request, HTTPException, BackgroundTasks
from fastapi.responses import JSONResponse

router = APIRouter(prefix="/api/alert")

@router.post("/webhook")
async def alertmanager_webhook(
    request: Request,
    background_tasks: BackgroundTasks,
):
    # Validate basic structure
    try:
        body = await request.json()
    except Exception:
        raise HTTPException(status_code=400, detail="Invalid JSON body")
    
    alerts = body.get("alerts", [])
    if not alerts:
        return JSONResponse({"status": "ok", "processed": 0})
    
    processed = 0
    for alert in alerts:
        if alert.get("status") == "resolved":
            event = translate_resolved(alert)
            if event:
                background_tasks.add_task(handle_resolved, event, request.app.state)
            continue
        
        event = translate_alert(alert)
        if event is None:
            continue
        
        # Route through the same rule engine as watch-loop events
        background_tasks.add_task(
            request.app.state.rule_engine.process,
            event,
            request.app.state.redis,
        )
        processed += 1
    
    return JSONResponse({"status": "ok", "processed": processed})
```

Returning quickly and processing in the background is essential. Alertmanager has a webhook send timeout (default 30 seconds) and will retry if the receiver times out. A slow RCA call must not block the acknowledgement.

***

## Alertmanager Configuration

Alertmanager routes fired alerts to the agent webhook. All alerts from the Kubernetes namespace go to the `simple-ai-agent` receiver:

```yaml
# config/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: "default-receiver"
  group_by: ["alertname", "namespace"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    # Send all K8s alerts to the AI agent
    - matchers:
        - namespace =~ ".+"
      receiver: "simple-ai-agent"
      group_wait: 15s
      group_interval: 2m
      repeat_interval: 1h
      continue: false  # Don't pass to further routes after matching

receivers:
  - name: "default-receiver"
    # No-op fallback

  - name: "simple-ai-agent"
    webhook_configs:
      - url: "http://simple-ai-agent:8000/api/alert/webhook"
        send_resolved: true
        max_alerts: 20   # Limit batch size per call
        timeout: 10s
        http_config:
          follow_redirects: false
```

### Group Wait and Group Interval

`group_wait: 15s` means Alertmanager waits 15 seconds after the first alert in a group before sending. This gives other alerts that would be grouped together time to accumulate. Without grouping, a node failure that triggers 10 pod alerts would generate 10 separate webhook calls.

`repeat_interval: 1h` means Alertmanager won't re-send the same firing alert to the webhook more than once per hour. The rule engine cooldown handles per-event deduplication, but this is a second layer of protection at the Alertmanager level.

***

## Deduplication Across Sources

The same underlying problem can create events from both the watch-loop and Alertmanager simultaneously. A pod that's `CrashLoopBackOff`:

1. The watch-loop detects it via direct pod status check
2. Alertmanager fires `KubePodCrashLooping` 5 minutes later

Both events would trigger the rule engine unless deduplicated.

The dedup key accounts for source:

```python
# src/monitoring/watchloop.py (dedup logic)
async def is_duplicate(event: ClusterEvent, redis: Redis) -> bool:
    # Deduplicate across both sources within a 5-minute window
    dedup_key = f"event_seen:{event.event_type}:{event.namespace}:{event.resource_name}"
    exists = await redis.exists(dedup_key)
    return bool(exists)

async def record_event(event: ClusterEvent, redis: Redis) -> None:
    dedup_key = f"event_seen:{event.event_type}:{event.namespace}:{event.resource_name}"
    await redis.setex(dedup_key, 300, event.source)  # 5-minute dedup window
```

The dedup key is intentionally source-agnostic: both `watchloop` and `alertmanager` events for the same pod get the same key. The first one to arrive processes; the second one is silently dropped within the 5-minute window. This also means the cooldown set by the rule engine applies to both sources equally.

***

## Handling Resolved Alerts

When Alertmanager sends a `"status": "resolved"` notification, the problem is over. The agent should:

1. Mark any active cooldowns as expired (event is resolvable again)
2. Post a resolution notification to the channel
3. Clean up any pending approvals for that resource

```python
async def handle_resolved(event: ClusterEvent, state: AppState) -> None:
    log.info("alert.resolved", 
             resource=event.resource_name, 
             namespace=event.namespace,
             event_type=event.event_type)
    
    # Remove dedup key so the event can trigger again if it re-occurs
    dedup_key = f"event_seen:{event.event_type}:{event.namespace}:{event.resource_name}"
    await state.redis.delete(dedup_key)
    
    # Notify resolution
    await state.notifier.send_resolved(
        namespace=event.namespace,
        resource=event.resource_name,
        event_type=event.event_type,
    )
    
    # Log the resolution for the RCA engine to reference
    resolved_key = f"resolved:{event.namespace}:{event.resource_name}"
    await state.redis.setex(resolved_key, 3600, event.metadata.get("fingerprint", ""))
```

***

## Real Alert Rules from the Project

These Prometheus alert rules in `config/alert_rules.yml` feed into the Alertmanager → agent pipeline:

```yaml
# config/alert_rules.yml (Prometheus rules)
groups:
  - name: kubernetes.rules
    interval: 1m
    rules:

      - alert: KubePodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[15m]) > 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod has restarted {{ $value }} times in the last 15 minutes."

      - alert: KubePodNotReady
        expr: |
          kube_pod_status_phase{phase=~"Pending|Unknown", namespace!~"kube-system|kube-public"} > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10+ minutes"

      - alert: KubeNodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} not ready"

      - alert: KubePersistentVolumeFillingUp
        expr: |
          (
            kubelet_volume_stats_available_bytes /
            kubelet_volume_stats_capacity_bytes
          ) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is {{ $value | humanize }}% full"
```

***

## What I Learned

**Alertmanager grouping requires `group_by` to be correct.** My first configuration grouped only by `alertname`. A node failure that caused 10 pods to become NotReady sent one webhook group with 10 alerts. The rule engine processed all 10 — creating 10 approval requests, 10 RCA calls, 10 notifications. Setting `group_by: ["alertname", "namespace"]` reduced this to 1–2 groups. Better, but still not perfect. For node failures, `group_by: ["instance"]` or `group_by: ["node"]` is the right choice.

**`send_resolved: true` is critical.** Without it, I had no way to know when a problem resolved. Cooldowns would expire and the next recurrence would look like a new event, which is correct — but I lost the signal that "this alert resolved cleanly between occurrences" which is useful for distinguishing intermittent vs. persistent failures.

**The `max_alerts: 20` limit prevents webhook flooding during alert storms.** When I first set up the cluster without this, a single node failure produced a webhook call with 47 alerts in one batch. The agent processed all 47 sequentially, took 8 minutes, and flooded the Telegram channel with notifications. Batching at 20 with Alertmanager's repeat logic is a better tradeoff.

**The `generatorURL` is genuinely useful in the RCA prompt.** Including `"Prometheus query that fired: <url>"` in the evidence gives the LLM context that the diagnosis is metric-based, not just log-based. It also gives the on-call engineer a direct link to the Prometheus graph — something I embedded in the notification format.

***

**Next**: [Article 8 — Observability: Making the Agent Itself Observable](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-observability)
