Article 4: The Rule Engine — Turning Events into Actionable Alerts

Introduction

The watch-loop produces ClusterEvent objects. The rule engine decides what to do with them.

Without a rule engine, every ClusterEvent would trigger the same response — start RCA, run playbooks, notify the user. That's wrong. A pod in CrashLoopBackOff in the development namespace at 3 PM on a Wednesday is not the same as one in production at 2 AM. A node that's been NotReady for 2 minutes may be rebooting intentionally. The rule engine is where severity classification, routing decisions, and cooldown logic live.

This article covers src/aiops/rule_engine.py and config/alert_rules.yml from simple-ai-agent.

Design Goals

When I designed the rule engine, I had three requirements:

YAML configuration, not code. Alert rules should be editable without touching Python. Operations changes (adjust severity, add a new pattern, change a threshold) should not require a code deploy.
Rules are additive and ordered. The first matching rule wins. If no rule matches, the default severity from the ClusterEvent is used. This makes it easy to add namespace-specific overrides above general rules.
The rule engine is stateless between rules. It processes one ClusterEvent at a time. State (cooldowns, seen events) is managed at the watch-loop level and in Redis, not inside rule evaluation.

YAML Rule Format

Rules live in config/alert_rules.yml. Here's the schema:

# config/alert_rules.yml
rules:
  - name: "production-crash-loop-critical"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^production$"
      min_restarts: 3
    severity: "critical"
    actions:
      - "notify"
      - "rca"
      - "playbook:restart-pod"
    cooldown_seconds: 300

  - name: "staging-crash-loop-high"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^(staging|uat)$"
      min_restarts: 5
    severity: "high"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 600

  - name: "development-crash-loop-low"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^(dev|development|feature-.*)$"
    severity: "low"
    actions:
      - "notify"
    cooldown_seconds: 1800

  - name: "oom-killed-any"
    event_type: "oom_killed"
    conditions: {}   # matches any namespace
    severity: "critical"
    actions:
      - "notify"
      - "rca"
      - "playbook:increase-memory-limit"
    cooldown_seconds: 600

  - name: "node-not-ready"
    event_type: "not_ready_node"
    conditions:
      min_duration_seconds: 120
    severity: "critical"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 300

Rule Fields

Field

Type

Description

name

string

Unique identifier, used in logs and metrics

event_type

string

Must match ClusterEvent.event_type

conditions

object

Optional filters (namespace pattern, thresholds)

severity

string

Overrides the severity from ClusterEvent if set

actions

list

What to do when this rule matches

cooldown_seconds

int

Minimum interval before this rule fires again for the same resource

Available Actions

Action

Behavior

notify

Send a message to the configured notification channel

rca

Trigger the RCA engine and attach the report to the notification

playbook:<name>

Execute the named playbook from config/playbooks.yml

Rule Matching Logic

The rule engine evaluates rules in order. The first rule where event_type matches AND all conditions pass is the winning rule.

# src/aiops/rule_engine.py (simplified)
import re
import yaml
from pathlib import Path
from dataclasses import dataclass
from typing import Any

@dataclass
class Rule:
    name: str
    event_type: str
    conditions: dict[str, Any]
    severity: str
    actions: list[str]
    cooldown_seconds: int

class RuleEngine:
    def __init__(self, rules_path: Path):
        self.rules: list[Rule] = self._load_rules(rules_path)
    
    def _load_rules(self, path: Path) -> list[Rule]:
        raw = yaml.safe_load(path.read_text())
        return [Rule(**r) for r in raw.get("rules", [])]
    
    def match(self, event: ClusterEvent) -> Rule | None:
        for rule in self.rules:
            if rule.event_type != event.event_type:
                continue
            if self._conditions_match(rule.conditions, event):
                return rule
        return None
    
    def _conditions_match(self, conditions: dict, event: ClusterEvent) -> bool:
        if not conditions:
            return True  # Empty conditions always match
        
        # Namespace pattern filter
        if ns_pattern := conditions.get("namespace_pattern"):
            if not re.match(ns_pattern, event.namespace):
                return False
        
        # Restart count threshold (for crash_loop)
        if min_restarts := conditions.get("min_restarts"):
            actual = event.metadata.get("restart_count", 0)
            if actual < min_restarts:
                return False
        
        # Duration filter (for not_ready_node)
        if min_duration := conditions.get("min_duration_seconds"):
            duration = event.metadata.get("duration_seconds", 0)
            if duration < min_duration:
                return False
        
        return True

Severity Mapping and Routing

After matching, the rule engine constructs an AlertContext that the rest of the pipeline consumes:

@dataclass
class AlertContext:
    event: ClusterEvent
    rule: Rule
    effective_severity: str  # rule.severity overrides event.severity
    should_notify: bool
    should_run_rca: bool
    playbook_name: str | None

def build_alert_context(event: ClusterEvent, rule: Rule) -> AlertContext:
    return AlertContext(
        event=event,
        rule=rule,
        effective_severity=rule.severity or event.severity,
        should_notify="notify" in rule.actions,
        should_run_rca="rca" in rule.actions,
        playbook_name=next(
            (a.removeprefix("playbook:") for a in rule.actions if a.startswith("playbook:")),
            None
        ),
    )

This separation is deliberate. The rule says "what to do". The AlertContext makes those decisions explicit and testable.

Severity → Notification Priority Mapping

Severity

Notification behavior

critical

Immediate message, includes RCA if rca action is set

high

Immediate message

medium

Message sent, lower visual priority

low

Message sent if notify action present, no RCA

info

Logged only, no message

Cooldown Windows and Deduplication

A pod stuck in CrashLoopBackOff for an hour would fire a matching rule every 30 seconds without cooldowns. That's 120 duplicate notifications per hour — worse than no automation at all.

Cooldowns are tracked in Redis:

# src/aiops/rule_engine.py
async def is_on_cooldown(
    self,
    rule: Rule,
    event: ClusterEvent,
    redis: Redis,
) -> bool:
    cooldown_key = f"cooldown:{rule.name}:{event.namespace}:{event.resource_name}"
    exists = await redis.exists(cooldown_key)
    return bool(exists)

async def set_cooldown(
    self,
    rule: Rule,
    event: ClusterEvent,
    redis: Redis,
) -> None:
    cooldown_key = f"cooldown:{rule.name}:{event.namespace}:{event.resource_name}"
    await redis.setex(cooldown_key, rule.cooldown_seconds, "1")

The cooldown key is namespaced by rule.name + namespace + resource_name. This means:

The same pod can trigger different rules with independent cooldowns
A CrashLoopBackOff in production and one in staging have independent cooldowns
Changing the rule name resets all cooldowns for that rule (intentional — useful after rule fixes)

Process Flow

async def process(self, event: ClusterEvent, redis: Redis) -> None:
    rule = self.match(event)
    
    if rule is None:
        # No matching rule — use defaults from the event itself
        log.info("rule_engine.no_match", event_type=event.event_type, 
                 namespace=event.namespace, resource=event.resource_name)
        return
    
    if await self.is_on_cooldown(rule, event, redis):
        log.debug("rule_engine.cooldown", rule=rule.name, resource=event.resource_name)
        RULE_COOLDOWNS.labels(rule=rule.name).inc()
        return
    
    # Build context and hand off to downstream pipeline
    context = build_alert_context(event, rule)
    await self.pipeline.handle(context)
    
    # Set cooldown after successful handling
    await self.set_cooldown(rule, event, redis)
    
    RULE_MATCHES.labels(rule=rule.name, severity=context.effective_severity).inc()
    log.info(
        "rule_engine.matched",
        rule=rule.name,
        severity=context.effective_severity,
        resource=event.resource_name,
        namespace=event.namespace,
    )

Rule Engine Implementation

Putting it together, the full processing pipeline looks like this:

Real Rules from the Project

These are the actual rules I run in my homelab cluster:

# config/alert_rules.yml
rules:

  # CrashLoopBackOff in production — act immediately
  - name: "crash-loop-production"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^production$"
      min_restarts: 2
    severity: "critical"
    actions:
      - "notify"
      - "rca"
      - "playbook:restart-crashed-pod"
    cooldown_seconds: 300

  # CrashLoopBackOff anywhere else — notify but don't remediate
  - name: "crash-loop-other"
    event_type: "crash_loop"
    conditions: {}
    severity: "high"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 600

  # OOMKilled — always act, anywhere
  - name: "oom-killed"
    event_type: "oom_killed"
    conditions: {}
    severity: "critical"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 600

  # Node not ready — must be down for 2+ minutes to avoid restart-triggered alerts
  - name: "node-not-ready"
    event_type: "not_ready_node"
    conditions:
      min_duration_seconds: 120
    severity: "critical"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 300

  # Zero replicas on a deployment
  - name: "zero-replicas"
    event_type: "replication_failure"
    conditions:
      min_duration_seconds: 60
    severity: "high"
    actions:
      - "notify"
    cooldown_seconds: 300

What I Learned

Rule order matters more than you think. I had a general crash_loop rule before a production-only crash loop rule, which meant the general rule always matched first and the production-specific cooldown and playbook never triggered. Moving specific rules before general rules is not optional.

Namespace patterns prevent alert storms during development. When I'm iterating on a service in development, pods crash constantly. Without a namespace filter or a significantly longer cooldown for non-production namespaces, my phone was buzzing every 30 seconds. Adding min_restarts: 10 and cooldown_seconds: 3600 for the development namespace made development usable again.

Cooldown keys in Redis need to be inspectable. When I suspected a rule wasn't firing, the first thing I needed was to know whether there was an active cooldown. Adding a command to list all active cooldoms was valuable:

# Check active cooldowns
redis-cli KEYS "cooldown:*" | sort
# Example output:
# cooldown:crash-loop-production:production:api-server-abc123

The rca action is expensive. It makes an LLM API call. I don't add it to every rule — only critical and high severity where the information is worth the ~10-second wait and the API cost. Low-severity rules get notify only.

Load YAML rules at startup, not on every event. My first implementation reloaded the YAML file on every event to support hot-reload. The file I/O and YAML parsing was measurable overhead. Now I reload only on SIGHUP or startup, which is fine for a homelab use case.

Next: Article 5 — Playbooks and Human-in-the-Loop Approvals

PreviousArticle 3: The Watch-Loop — Background Cluster Polling NextArticle 5: Playbooks and Human-in-the-Loop Approvals

Last updated 28 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagDesign Goals

hashtagYAML Rule Format

hashtagRule Fields

hashtagAvailable Actions

hashtagRule Matching Logic

hashtagSeverity Mapping and Routing

hashtagSeverity → Notification Priority Mapping

hashtagCooldown Windows and Deduplication

hashtagProcess Flow

hashtagRule Engine Implementation

hashtagReal Rules from the Project

hashtagWhat I Learned

Introduction

Table of Contents

Design Goals

YAML Rule Format

Rule Fields

Available Actions

Rule Matching Logic

Severity Mapping and Routing

Severity → Notification Priority Mapping

Cooldown Windows and Deduplication

Process Flow

Rule Engine Implementation

Real Rules from the Project

What I Learned