Article 7: Alertmanager Webhook Integration
Introduction
The watch-loop is pull-based: the agent polls the Kubernetes API on a fixed interval. But Prometheus and Alertmanager have already built up years of alert rules that fire on real signals β CPU throttling, out-of-disk, response latency, deployment age, certificate expiry. Ignoring those signals and reinventing them in the watch-loop would be duplicate work.
The Alertmanager webhook integration is the bridge: Alertmanager pushes fired alerts to the agent as HTTP POST requests, and the agent translates them into ClusterEvent objects that flow through the same rule engine, playbook system, and RCA pipeline as watch-loop events.
This article covers src/api/webhooks.py and config/alertmanager.yml from simple-ai-agent.
Table of Contents
Why Both Sources Coexist
The watch-loop and the Alertmanager webhook detect different things:
Watch-loop
Pod-level state (CrashLoopBackOff, OOMKilled, NotReady) detected by direct API polling
Alertmanager
Any metric-based threshold β CPU, memory, network, latency, disk, custom app metrics
A pod can be running and ready but consuming 95% of its CPU limit. The watch-loop won't see that. Alertmanager's KubePodCPUThrottling rule will. Conversely, Alertmanager's alert evaluation interval (often 1-5 minutes) means it may miss a crash-and-recover cycle that the watch-loop would catch.
Running both gives better coverage without requiring either source to do more than it's designed for.
Alertmanager Webhook Payload Format
When Alertmanager fires an alert group to a webhook receiver, it sends a JSON POST with this structure:
Key fields:
alerts[].labels.alertnameβ the Prometheus alert rule namealerts[].labels.namespace/pod/nodeβ resource identifiersalerts[].labels.severityβ from the Prometheus rule annotationalerts[].statusβ"firing"or"resolved"alerts[].fingerprintβ Alertmanager's dedup identifier
Translating Alerts to ClusterEvents
The translation layer maps Alertmanager label structures to ClusterEvent objects. Each Prometheus alert name maps to an EventType:
The source Field
source FieldClusterEvent includes a source field set to "alertmanager" for webhook-originating events and "watchloop" for poll-originating events. This surfaces in logs, metrics, and the RCA context β it matters because the evidence collection strategy differs slightly. For Alertmanager events, the generatorURL points to the exact Prometheus query that fired, which is useful context for the RCA prompt.
Webhook Receiver Implementation
Returning quickly and processing in the background is essential. Alertmanager has a webhook send timeout (default 30 seconds) and will retry if the receiver times out. A slow RCA call must not block the acknowledgement.
Alertmanager Configuration
Alertmanager routes fired alerts to the agent webhook. All alerts from the Kubernetes namespace go to the simple-ai-agent receiver:
Group Wait and Group Interval
group_wait: 15s means Alertmanager waits 15 seconds after the first alert in a group before sending. This gives other alerts that would be grouped together time to accumulate. Without grouping, a node failure that triggers 10 pod alerts would generate 10 separate webhook calls.
repeat_interval: 1h means Alertmanager won't re-send the same firing alert to the webhook more than once per hour. The rule engine cooldown handles per-event deduplication, but this is a second layer of protection at the Alertmanager level.
Deduplication Across Sources
The same underlying problem can create events from both the watch-loop and Alertmanager simultaneously. A pod that's CrashLoopBackOff:
The watch-loop detects it via direct pod status check
Alertmanager fires
KubePodCrashLooping5 minutes later
Both events would trigger the rule engine unless deduplicated.
The dedup key accounts for source:
The dedup key is intentionally source-agnostic: both watchloop and alertmanager events for the same pod get the same key. The first one to arrive processes; the second one is silently dropped within the 5-minute window. This also means the cooldown set by the rule engine applies to both sources equally.
Handling Resolved Alerts
When Alertmanager sends a "status": "resolved" notification, the problem is over. The agent should:
Mark any active cooldowns as expired (event is resolvable again)
Post a resolution notification to the channel
Clean up any pending approvals for that resource
Real Alert Rules from the Project
These Prometheus alert rules in config/alert_rules.yml feed into the Alertmanager β agent pipeline:
What I Learned
Alertmanager grouping requires group_by to be correct. My first configuration grouped only by alertname. A node failure that caused 10 pods to become NotReady sent one webhook group with 10 alerts. The rule engine processed all 10 β creating 10 approval requests, 10 RCA calls, 10 notifications. Setting group_by: ["alertname", "namespace"] reduced this to 1β2 groups. Better, but still not perfect. For node failures, group_by: ["instance"] or group_by: ["node"] is the right choice.
send_resolved: true is critical. Without it, I had no way to know when a problem resolved. Cooldowns would expire and the next recurrence would look like a new event, which is correct β but I lost the signal that "this alert resolved cleanly between occurrences" which is useful for distinguishing intermittent vs. persistent failures.
The max_alerts: 20 limit prevents webhook flooding during alert storms. When I first set up the cluster without this, a single node failure produced a webhook call with 47 alerts in one batch. The agent processed all 47 sequentially, took 8 minutes, and flooded the Telegram channel with notifications. Batching at 20 with Alertmanager's repeat logic is a better tradeoff.
The generatorURL is genuinely useful in the RCA prompt. Including "Prometheus query that fired: <url>" in the evidence gives the LLM context that the diagnosis is metric-based, not just log-based. It also gives the on-call engineer a direct link to the Prometheus graph β something I embedded in the notification format.
Next: Article 8 β Observability: Making the Agent Itself Observable
Last updated