Article 4: The Rule Engine — Turning Events into Actionable Alerts
Introduction
The watch-loop produces ClusterEvent objects. The rule engine decides what to do with them.
Without a rule engine, every ClusterEvent would trigger the same response — start RCA, run playbooks, notify the user. That's wrong. A pod in CrashLoopBackOff in the development namespace at 3 PM on a Wednesday is not the same as one in production at 2 AM. A node that's been NotReady for 2 minutes may be rebooting intentionally. The rule engine is where severity classification, routing decisions, and cooldown logic live.
This article covers src/aiops/rule_engine.py and config/alert_rules.yml from simple-ai-agent.
Table of Contents
Design Goals
When I designed the rule engine, I had three requirements:
YAML configuration, not code. Alert rules should be editable without touching Python. Operations changes (adjust severity, add a new pattern, change a threshold) should not require a code deploy.
Rules are additive and ordered. The first matching rule wins. If no rule matches, the default severity from the
ClusterEventis used. This makes it easy to add namespace-specific overrides above general rules.The rule engine is stateless between rules. It processes one
ClusterEventat a time. State (cooldowns, seen events) is managed at the watch-loop level and in Redis, not inside rule evaluation.
YAML Rule Format
Rules live in config/alert_rules.yml. Here's the schema:
Rule Fields
name
string
Unique identifier, used in logs and metrics
event_type
string
Must match ClusterEvent.event_type
conditions
object
Optional filters (namespace pattern, thresholds)
severity
string
Overrides the severity from ClusterEvent if set
actions
list
What to do when this rule matches
cooldown_seconds
int
Minimum interval before this rule fires again for the same resource
Available Actions
notify
Send a message to the configured notification channel
rca
Trigger the RCA engine and attach the report to the notification
playbook:<name>
Execute the named playbook from config/playbooks.yml
Rule Matching Logic
The rule engine evaluates rules in order. The first rule where event_type matches AND all conditions pass is the winning rule.
Severity Mapping and Routing
After matching, the rule engine constructs an AlertContext that the rest of the pipeline consumes:
This separation is deliberate. The rule says "what to do". The AlertContext makes those decisions explicit and testable.
Severity → Notification Priority Mapping
critical
Immediate message, includes RCA if rca action is set
high
Immediate message
medium
Message sent, lower visual priority
low
Message sent if notify action present, no RCA
info
Logged only, no message
Cooldown Windows and Deduplication
A pod stuck in CrashLoopBackOff for an hour would fire a matching rule every 30 seconds without cooldowns. That's 120 duplicate notifications per hour — worse than no automation at all.
Cooldowns are tracked in Redis:
The cooldown key is namespaced by rule.name + namespace + resource_name. This means:
The same pod can trigger different rules with independent cooldowns
A
CrashLoopBackOffinproductionand one instaginghave independent cooldownsChanging the rule name resets all cooldowns for that rule (intentional — useful after rule fixes)
Process Flow
Rule Engine Implementation
Putting it together, the full processing pipeline looks like this:
Real Rules from the Project
These are the actual rules I run in my homelab cluster:
What I Learned
Rule order matters more than you think. I had a general crash_loop rule before a production-only crash loop rule, which meant the general rule always matched first and the production-specific cooldown and playbook never triggered. Moving specific rules before general rules is not optional.
Namespace patterns prevent alert storms during development. When I'm iterating on a service in development, pods crash constantly. Without a namespace filter or a significantly longer cooldown for non-production namespaces, my phone was buzzing every 30 seconds. Adding min_restarts: 10 and cooldown_seconds: 3600 for the development namespace made development usable again.
Cooldown keys in Redis need to be inspectable. When I suspected a rule wasn't firing, the first thing I needed was to know whether there was an active cooldown. Adding a command to list all active cooldoms was valuable:
The rca action is expensive. It makes an LLM API call. I don't add it to every rule — only critical and high severity where the information is worth the ~10-second wait and the API cost. Low-severity rules get notify only.
Load YAML rules at startup, not on every event. My first implementation reloaded the YAML file on every event to support hot-reload. The file I/O and YAML parsing was measurable overhead. Now I reload only on SIGHUP or startup, which is fine for a homelab use case.
Last updated