Article 7: Alertmanager Webhook Integration

Introduction

The watch-loop is pull-based: the agent polls the Kubernetes API on a fixed interval. But Prometheus and Alertmanager have already built up years of alert rules that fire on real signals — CPU throttling, out-of-disk, response latency, deployment age, certificate expiry. Ignoring those signals and reinventing them in the watch-loop would be duplicate work.

The Alertmanager webhook integration is the bridge: Alertmanager pushes fired alerts to the agent as HTTP POST requests, and the agent translates them into ClusterEvent objects that flow through the same rule engine, playbook system, and RCA pipeline as watch-loop events.

This article covers src/api/webhooks.py and config/alertmanager.yml from simple-ai-agent.

Why Both Sources Coexist

The watch-loop and the Alertmanager webhook detect different things:

Source

What it detects

Watch-loop

Pod-level state (CrashLoopBackOff, OOMKilled, NotReady) detected by direct API polling

Alertmanager

Any metric-based threshold — CPU, memory, network, latency, disk, custom app metrics

A pod can be running and ready but consuming 95% of its CPU limit. The watch-loop won't see that. Alertmanager's KubePodCPUThrottling rule will. Conversely, Alertmanager's alert evaluation interval (often 1-5 minutes) means it may miss a crash-and-recover cycle that the watch-loop would catch.

Running both gives better coverage without requiring either source to do more than it's designed for.

Alertmanager Webhook Payload Format

When Alertmanager fires an alert group to a webhook receiver, it sends a JSON POST with this structure:

{
  "receiver": "simple-ai-agent",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "KubePodCrashLooping",
        "namespace": "production",
        "pod": "api-server-abc123-7f9b",
        "container": "api-server",
        "severity": "critical"
      },
      "annotations": {
        "summary": "Pod is crash looping",
        "description": "Pod production/api-server-abc123-7f9b has been restarting more than 1 time / 5 minutes."
      },
      "startsAt": "2024-01-15T03:22:00.000Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://prometheus:9090/graph?g0.expr=...",
      "fingerprint": "a3b4c5d6e7f8a1b2"
    }
  ],
  "groupLabels": {
    "namespace": "production"
  },
  "commonLabels": {
    "severity": "critical"
  },
  "version": "4",
  "externalURL": "http://alertmanager:9093"
}

Key fields:

alerts[].labels.alertname — the Prometheus alert rule name
alerts[].labels.namespace / pod / node — resource identifiers
alerts[].labels.severity — from the Prometheus rule annotation
alerts[].status — "firing" or "resolved"
alerts[].fingerprint — Alertmanager's dedup identifier

Translating Alerts to ClusterEvents

The translation layer maps Alertmanager label structures to ClusterEvent objects. Each Prometheus alert name maps to an EventType:

# src/api/webhooks.py
from src.monitoring.watchloop import ClusterEvent, EventType, Severity
from datetime import datetime

# Mapping from Prometheus alertname to EventType
ALERT_TO_EVENT_TYPE: dict[str, str] = {
    "KubePodCrashLooping":             "crash_loop",
    "KubeContainerWaiting":            "crash_loop",
    "KubeOOMKillingSoon":              "oom_killed",
    "KubeNodeNotReady":                "not_ready_node",
    "KubeDeploymentReplicasMismatch":  "replication_failure",
    "KubePodNotReady":                 "crash_loop",  # Conservative mapping
    "KubeCPUThrottling":               "resource_pressure",
    "KubePersistentVolumeFillingUp":   "storage_pressure",
}

SEVERITY_MAP: dict[str, str] = {
    "critical": "critical",
    "warning":  "high",
    "info":     "low",
}

def translate_alert(alert: dict) -> ClusterEvent | None:
    labels      = alert.get("labels", {})
    annotations = alert.get("annotations", {})
    alertname   = labels.get("alertname", "")
    
    event_type = ALERT_TO_EVENT_TYPE.get(alertname)
    if not event_type:
        # Unknown alert — log and skip translation
        return None
    
    namespace     = labels.get("namespace", "default")
    resource_name = (
        labels.get("pod") or
        labels.get("deployment") or
        labels.get("node") or
        labels.get("alertname")  # Fallback
    )
    
    severity = SEVERITY_MAP.get(labels.get("severity", "warning"), "high")
    
    return ClusterEvent(
        event_type=event_type,
        namespace=namespace,
        resource_name=resource_name,
        severity=severity,
        source="alertmanager",
        detected_at=datetime.utcnow(),
        metadata={
            "alertname":    alertname,
            "fingerprint":  alert.get("fingerprint"),
            "description":  annotations.get("description", ""),
            "generator_url": alert.get("generatorURL", ""),
            # Carry over raw labels for the RCA engine
            "labels":       labels,
        }
    )

The `source` Field

ClusterEvent includes a source field set to "alertmanager" for webhook-originating events and "watchloop" for poll-originating events. This surfaces in logs, metrics, and the RCA context — it matters because the evidence collection strategy differs slightly. For Alertmanager events, the generatorURL points to the exact Prometheus query that fired, which is useful context for the RCA prompt.

Webhook Receiver Implementation

# src/api/webhooks.py
from fastapi import APIRouter, Request, HTTPException, BackgroundTasks
from fastapi.responses import JSONResponse

router = APIRouter(prefix="/api/alert")

@router.post("/webhook")
async def alertmanager_webhook(
    request: Request,
    background_tasks: BackgroundTasks,
):
    # Validate basic structure
    try:
        body = await request.json()
    except Exception:
        raise HTTPException(status_code=400, detail="Invalid JSON body")
    
    alerts = body.get("alerts", [])
    if not alerts:
        return JSONResponse({"status": "ok", "processed": 0})
    
    processed = 0
    for alert in alerts:
        if alert.get("status") == "resolved":
            event = translate_resolved(alert)
            if event:
                background_tasks.add_task(handle_resolved, event, request.app.state)
            continue
        
        event = translate_alert(alert)
        if event is None:
            continue
        
        # Route through the same rule engine as watch-loop events
        background_tasks.add_task(
            request.app.state.rule_engine.process,
            event,
            request.app.state.redis,
        )
        processed += 1
    
    return JSONResponse({"status": "ok", "processed": processed})

Returning quickly and processing in the background is essential. Alertmanager has a webhook send timeout (default 30 seconds) and will retry if the receiver times out. A slow RCA call must not block the acknowledgement.

Alertmanager Configuration

Alertmanager routes fired alerts to the agent webhook. All alerts from the Kubernetes namespace go to the simple-ai-agent receiver:

# config/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: "default-receiver"
  group_by: ["alertname", "namespace"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    # Send all K8s alerts to the AI agent
    - matchers:
        - namespace =~ ".+"
      receiver: "simple-ai-agent"
      group_wait: 15s
      group_interval: 2m
      repeat_interval: 1h
      continue: false  # Don't pass to further routes after matching

receivers:
  - name: "default-receiver"
    # No-op fallback

  - name: "simple-ai-agent"
    webhook_configs:
      - url: "http://simple-ai-agent:8000/api/alert/webhook"
        send_resolved: true
        max_alerts: 20   # Limit batch size per call
        timeout: 10s
        http_config:
          follow_redirects: false

Group Wait and Group Interval

group_wait: 15s means Alertmanager waits 15 seconds after the first alert in a group before sending. This gives other alerts that would be grouped together time to accumulate. Without grouping, a node failure that triggers 10 pod alerts would generate 10 separate webhook calls.

repeat_interval: 1h means Alertmanager won't re-send the same firing alert to the webhook more than once per hour. The rule engine cooldown handles per-event deduplication, but this is a second layer of protection at the Alertmanager level.

Deduplication Across Sources

The same underlying problem can create events from both the watch-loop and Alertmanager simultaneously. A pod that's CrashLoopBackOff:

The watch-loop detects it via direct pod status check
Alertmanager fires KubePodCrashLooping 5 minutes later

Both events would trigger the rule engine unless deduplicated.

The dedup key accounts for source:

# src/monitoring/watchloop.py (dedup logic)
async def is_duplicate(event: ClusterEvent, redis: Redis) -> bool:
    # Deduplicate across both sources within a 5-minute window
    dedup_key = f"event_seen:{event.event_type}:{event.namespace}:{event.resource_name}"
    exists = await redis.exists(dedup_key)
    return bool(exists)

async def record_event(event: ClusterEvent, redis: Redis) -> None:
    dedup_key = f"event_seen:{event.event_type}:{event.namespace}:{event.resource_name}"
    await redis.setex(dedup_key, 300, event.source)  # 5-minute dedup window

The dedup key is intentionally source-agnostic: both watchloop and alertmanager events for the same pod get the same key. The first one to arrive processes; the second one is silently dropped within the 5-minute window. This also means the cooldown set by the rule engine applies to both sources equally.

Handling Resolved Alerts

When Alertmanager sends a "status": "resolved" notification, the problem is over. The agent should:

Mark any active cooldowns as expired (event is resolvable again)
Post a resolution notification to the channel
Clean up any pending approvals for that resource

async def handle_resolved(event: ClusterEvent, state: AppState) -> None:
    log.info("alert.resolved", 
             resource=event.resource_name, 
             namespace=event.namespace,
             event_type=event.event_type)
    
    # Remove dedup key so the event can trigger again if it re-occurs
    dedup_key = f"event_seen:{event.event_type}:{event.namespace}:{event.resource_name}"
    await state.redis.delete(dedup_key)
    
    # Notify resolution
    await state.notifier.send_resolved(
        namespace=event.namespace,
        resource=event.resource_name,
        event_type=event.event_type,
    )
    
    # Log the resolution for the RCA engine to reference
    resolved_key = f"resolved:{event.namespace}:{event.resource_name}"
    await state.redis.setex(resolved_key, 3600, event.metadata.get("fingerprint", ""))

Real Alert Rules from the Project

These Prometheus alert rules in config/alert_rules.yml feed into the Alertmanager → agent pipeline:

# config/alert_rules.yml (Prometheus rules)
groups:
  - name: kubernetes.rules
    interval: 1m
    rules:

      - alert: KubePodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[15m]) > 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod has restarted {{ $value }} times in the last 15 minutes."

      - alert: KubePodNotReady
        expr: |
          kube_pod_status_phase{phase=~"Pending|Unknown", namespace!~"kube-system|kube-public"} > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10+ minutes"

      - alert: KubeNodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} not ready"

      - alert: KubePersistentVolumeFillingUp
        expr: |
          (
            kubelet_volume_stats_available_bytes /
            kubelet_volume_stats_capacity_bytes
          ) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is {{ $value | humanize }}% full"

What I Learned

Alertmanager grouping requires group_by to be correct. My first configuration grouped only by alertname. A node failure that caused 10 pods to become NotReady sent one webhook group with 10 alerts. The rule engine processed all 10 — creating 10 approval requests, 10 RCA calls, 10 notifications. Setting group_by: ["alertname", "namespace"] reduced this to 1–2 groups. Better, but still not perfect. For node failures, group_by: ["instance"] or group_by: ["node"] is the right choice.

send_resolved: true is critical. Without it, I had no way to know when a problem resolved. Cooldowns would expire and the next recurrence would look like a new event, which is correct — but I lost the signal that "this alert resolved cleanly between occurrences" which is useful for distinguishing intermittent vs. persistent failures.

The max_alerts: 20 limit prevents webhook flooding during alert storms. When I first set up the cluster without this, a single node failure produced a webhook call with 47 alerts in one batch. The agent processed all 47 sequentially, took 8 minutes, and flooded the Telegram channel with notifications. Batching at 20 with Alertmanager's repeat logic is a better tradeoff.

The generatorURL is genuinely useful in the RCA prompt. Including "Prometheus query that fired: <url>" in the evidence gives the LLM context that the diagnosis is metric-based, not just log-based. It also gives the on-call engineer a direct link to the Prometheus graph — something I embedded in the notification format.

Next: Article 8 — Observability: Making the Agent Itself Observable

PreviousArticle 6: LLM-Powered Root Cause Analysis NextArticle 8: Observability — Making the Agent Itself Observable

Last updated 28 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagWhy Both Sources Coexist

hashtagAlertmanager Webhook Payload Format

hashtagTranslating Alerts to ClusterEvents

hashtagThe source Field

hashtagWebhook Receiver Implementation

hashtagAlertmanager Configuration

hashtagGroup Wait and Group Interval

hashtagDeduplication Across Sources

hashtagHandling Resolved Alerts

hashtagReal Alert Rules from the Project

hashtagWhat I Learned