Article 4: The Rule Engine — Turning Events into Actionable Alerts

Introduction

The watch-loop produces ClusterEvent objects. The rule engine decides what to do with them.

Without a rule engine, every ClusterEvent would trigger the same response — start RCA, run playbooks, notify the user. That's wrong. A pod in CrashLoopBackOff in the development namespace at 3 PM on a Wednesday is not the same as one in production at 2 AM. A node that's been NotReady for 2 minutes may be rebooting intentionally. The rule engine is where severity classification, routing decisions, and cooldown logic live.

This article covers src/aiops/rule_engine.py and config/alert_rules.yml from simple-ai-agentarrow-up-right.

Table of Contents


Design Goals

When I designed the rule engine, I had three requirements:

  1. YAML configuration, not code. Alert rules should be editable without touching Python. Operations changes (adjust severity, add a new pattern, change a threshold) should not require a code deploy.

  2. Rules are additive and ordered. The first matching rule wins. If no rule matches, the default severity from the ClusterEvent is used. This makes it easy to add namespace-specific overrides above general rules.

  3. The rule engine is stateless between rules. It processes one ClusterEvent at a time. State (cooldowns, seen events) is managed at the watch-loop level and in Redis, not inside rule evaluation.


YAML Rule Format

Rules live in config/alert_rules.yml. Here's the schema:

Rule Fields

Field
Type
Description

name

string

Unique identifier, used in logs and metrics

event_type

string

Must match ClusterEvent.event_type

conditions

object

Optional filters (namespace pattern, thresholds)

severity

string

Overrides the severity from ClusterEvent if set

actions

list

What to do when this rule matches

cooldown_seconds

int

Minimum interval before this rule fires again for the same resource

Available Actions

Action
Behavior

notify

Send a message to the configured notification channel

rca

Trigger the RCA engine and attach the report to the notification

playbook:<name>

Execute the named playbook from config/playbooks.yml


Rule Matching Logic

The rule engine evaluates rules in order. The first rule where event_type matches AND all conditions pass is the winning rule.


Severity Mapping and Routing

After matching, the rule engine constructs an AlertContext that the rest of the pipeline consumes:

This separation is deliberate. The rule says "what to do". The AlertContext makes those decisions explicit and testable.

Severity → Notification Priority Mapping

Severity
Notification behavior

critical

Immediate message, includes RCA if rca action is set

high

Immediate message

medium

Message sent, lower visual priority

low

Message sent if notify action present, no RCA

info

Logged only, no message


Cooldown Windows and Deduplication

A pod stuck in CrashLoopBackOff for an hour would fire a matching rule every 30 seconds without cooldowns. That's 120 duplicate notifications per hour — worse than no automation at all.

Cooldowns are tracked in Redis:

The cooldown key is namespaced by rule.name + namespace + resource_name. This means:

  • The same pod can trigger different rules with independent cooldowns

  • A CrashLoopBackOff in production and one in staging have independent cooldowns

  • Changing the rule name resets all cooldowns for that rule (intentional — useful after rule fixes)

Process Flow


Rule Engine Implementation

Putting it together, the full processing pipeline looks like this:

spinner

Real Rules from the Project

These are the actual rules I run in my homelab cluster:


What I Learned

Rule order matters more than you think. I had a general crash_loop rule before a production-only crash loop rule, which meant the general rule always matched first and the production-specific cooldown and playbook never triggered. Moving specific rules before general rules is not optional.

Namespace patterns prevent alert storms during development. When I'm iterating on a service in development, pods crash constantly. Without a namespace filter or a significantly longer cooldown for non-production namespaces, my phone was buzzing every 30 seconds. Adding min_restarts: 10 and cooldown_seconds: 3600 for the development namespace made development usable again.

Cooldown keys in Redis need to be inspectable. When I suspected a rule wasn't firing, the first thing I needed was to know whether there was an active cooldown. Adding a command to list all active cooldoms was valuable:

The rca action is expensive. It makes an LLM API call. I don't add it to every rule — only critical and high severity where the information is worth the ~10-second wait and the API cost. Low-severity rules get notify only.

Load YAML rules at startup, not on every event. My first implementation reloaded the YAML file on every event to support hot-reload. The file I/O and YAML parsing was measurable overhead. Now I reload only on SIGHUP or startup, which is fine for a homelab use case.


Next: Article 5 — Playbooks and Human-in-the-Loop Approvals

Last updated