# Article 4: The Rule Engine — Turning Events into Actionable Alerts

## Introduction

The watch-loop produces `ClusterEvent` objects. The rule engine decides what to do with them.

Without a rule engine, every `ClusterEvent` would trigger the same response — start RCA, run playbooks, notify the user. That's wrong. A pod in `CrashLoopBackOff` in the `development` namespace at 3 PM on a Wednesday is not the same as one in `production` at 2 AM. A node that's been `NotReady` for 2 minutes may be rebooting intentionally. The rule engine is where severity classification, routing decisions, and cooldown logic live.

This article covers `src/aiops/rule_engine.py` and `config/alert_rules.yml` from [simple-ai-agent](https://github.com/Htunn/simple-ai-agent).

## Table of Contents

1. [Design Goals for the Rule Engine](#design-goals)
2. [YAML Rule Format](#yaml-rule-format)
3. [Rule Matching Logic](#rule-matching-logic)
4. [Severity Mapping and Routing](#severity-mapping-and-routing)
5. [Cooldown Windows and Deduplication](#cooldown-windows-and-deduplication)
6. [Rule Engine Implementation](#rule-engine-implementation)
7. [Real Rules from the Project](#real-rules-from-the-project)
8. [What I Learned](#what-i-learned)

***

## Design Goals

When I designed the rule engine, I had three requirements:

1. **YAML configuration, not code.** Alert rules should be editable without touching Python. Operations changes (adjust severity, add a new pattern, change a threshold) should not require a code deploy.
2. **Rules are additive and ordered.** The first matching rule wins. If no rule matches, the default severity from the `ClusterEvent` is used. This makes it easy to add namespace-specific overrides above general rules.
3. **The rule engine is stateless between rules.** It processes one `ClusterEvent` at a time. State (cooldowns, seen events) is managed at the watch-loop level and in Redis, not inside rule evaluation.

***

## YAML Rule Format

Rules live in `config/alert_rules.yml`. Here's the schema:

```yaml
# config/alert_rules.yml
rules:
  - name: "production-crash-loop-critical"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^production$"
      min_restarts: 3
    severity: "critical"
    actions:
      - "notify"
      - "rca"
      - "playbook:restart-pod"
    cooldown_seconds: 300

  - name: "staging-crash-loop-high"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^(staging|uat)$"
      min_restarts: 5
    severity: "high"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 600

  - name: "development-crash-loop-low"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^(dev|development|feature-.*)$"
    severity: "low"
    actions:
      - "notify"
    cooldown_seconds: 1800

  - name: "oom-killed-any"
    event_type: "oom_killed"
    conditions: {}   # matches any namespace
    severity: "critical"
    actions:
      - "notify"
      - "rca"
      - "playbook:increase-memory-limit"
    cooldown_seconds: 600

  - name: "node-not-ready"
    event_type: "not_ready_node"
    conditions:
      min_duration_seconds: 120
    severity: "critical"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 300
```

### Rule Fields

| Field              | Type   | Description                                                         |
| ------------------ | ------ | ------------------------------------------------------------------- |
| `name`             | string | Unique identifier, used in logs and metrics                         |
| `event_type`       | string | Must match `ClusterEvent.event_type`                                |
| `conditions`       | object | Optional filters (namespace pattern, thresholds)                    |
| `severity`         | string | Overrides the severity from `ClusterEvent` if set                   |
| `actions`          | list   | What to do when this rule matches                                   |
| `cooldown_seconds` | int    | Minimum interval before this rule fires again for the same resource |

### Available Actions

| Action            | Behavior                                                         |
| ----------------- | ---------------------------------------------------------------- |
| `notify`          | Send a message to the configured notification channel            |
| `rca`             | Trigger the RCA engine and attach the report to the notification |
| `playbook:<name>` | Execute the named playbook from `config/playbooks.yml`           |

***

## Rule Matching Logic

The rule engine evaluates rules in order. The first rule where `event_type` matches AND all `conditions` pass is the winning rule.

```python
# src/aiops/rule_engine.py (simplified)
import re
import yaml
from pathlib import Path
from dataclasses import dataclass
from typing import Any

@dataclass
class Rule:
    name: str
    event_type: str
    conditions: dict[str, Any]
    severity: str
    actions: list[str]
    cooldown_seconds: int

class RuleEngine:
    def __init__(self, rules_path: Path):
        self.rules: list[Rule] = self._load_rules(rules_path)
    
    def _load_rules(self, path: Path) -> list[Rule]:
        raw = yaml.safe_load(path.read_text())
        return [Rule(**r) for r in raw.get("rules", [])]
    
    def match(self, event: ClusterEvent) -> Rule | None:
        for rule in self.rules:
            if rule.event_type != event.event_type:
                continue
            if self._conditions_match(rule.conditions, event):
                return rule
        return None
    
    def _conditions_match(self, conditions: dict, event: ClusterEvent) -> bool:
        if not conditions:
            return True  # Empty conditions always match
        
        # Namespace pattern filter
        if ns_pattern := conditions.get("namespace_pattern"):
            if not re.match(ns_pattern, event.namespace):
                return False
        
        # Restart count threshold (for crash_loop)
        if min_restarts := conditions.get("min_restarts"):
            actual = event.metadata.get("restart_count", 0)
            if actual < min_restarts:
                return False
        
        # Duration filter (for not_ready_node)
        if min_duration := conditions.get("min_duration_seconds"):
            duration = event.metadata.get("duration_seconds", 0)
            if duration < min_duration:
                return False
        
        return True
```

***

## Severity Mapping and Routing

After matching, the rule engine constructs an `AlertContext` that the rest of the pipeline consumes:

```python
@dataclass
class AlertContext:
    event: ClusterEvent
    rule: Rule
    effective_severity: str  # rule.severity overrides event.severity
    should_notify: bool
    should_run_rca: bool
    playbook_name: str | None

def build_alert_context(event: ClusterEvent, rule: Rule) -> AlertContext:
    return AlertContext(
        event=event,
        rule=rule,
        effective_severity=rule.severity or event.severity,
        should_notify="notify" in rule.actions,
        should_run_rca="rca" in rule.actions,
        playbook_name=next(
            (a.removeprefix("playbook:") for a in rule.actions if a.startswith("playbook:")),
            None
        ),
    )
```

This separation is deliberate. The rule says "what to do". The `AlertContext` makes those decisions explicit and testable.

### Severity → Notification Priority Mapping

| Severity   | Notification behavior                                  |
| ---------- | ------------------------------------------------------ |
| `critical` | Immediate message, includes RCA if `rca` action is set |
| `high`     | Immediate message                                      |
| `medium`   | Message sent, lower visual priority                    |
| `low`      | Message sent if notify action present, no RCA          |
| `info`     | Logged only, no message                                |

***

## Cooldown Windows and Deduplication

A pod stuck in `CrashLoopBackOff` for an hour would fire a matching rule every 30 seconds without cooldowns. That's 120 duplicate notifications per hour — worse than no automation at all.

Cooldowns are tracked in Redis:

```python
# src/aiops/rule_engine.py
async def is_on_cooldown(
    self,
    rule: Rule,
    event: ClusterEvent,
    redis: Redis,
) -> bool:
    cooldown_key = f"cooldown:{rule.name}:{event.namespace}:{event.resource_name}"
    exists = await redis.exists(cooldown_key)
    return bool(exists)

async def set_cooldown(
    self,
    rule: Rule,
    event: ClusterEvent,
    redis: Redis,
) -> None:
    cooldown_key = f"cooldown:{rule.name}:{event.namespace}:{event.resource_name}"
    await redis.setex(cooldown_key, rule.cooldown_seconds, "1")
```

The cooldown key is namespaced by `rule.name + namespace + resource_name`. This means:

* The same pod can trigger different rules with independent cooldowns
* A `CrashLoopBackOff` in `production` and one in `staging` have independent cooldowns
* Changing the rule name resets all cooldowns for that rule (intentional — useful after rule fixes)

### Process Flow

```python
async def process(self, event: ClusterEvent, redis: Redis) -> None:
    rule = self.match(event)
    
    if rule is None:
        # No matching rule — use defaults from the event itself
        log.info("rule_engine.no_match", event_type=event.event_type, 
                 namespace=event.namespace, resource=event.resource_name)
        return
    
    if await self.is_on_cooldown(rule, event, redis):
        log.debug("rule_engine.cooldown", rule=rule.name, resource=event.resource_name)
        RULE_COOLDOWNS.labels(rule=rule.name).inc()
        return
    
    # Build context and hand off to downstream pipeline
    context = build_alert_context(event, rule)
    await self.pipeline.handle(context)
    
    # Set cooldown after successful handling
    await self.set_cooldown(rule, event, redis)
    
    RULE_MATCHES.labels(rule=rule.name, severity=context.effective_severity).inc()
    log.info(
        "rule_engine.matched",
        rule=rule.name,
        severity=context.effective_severity,
        resource=event.resource_name,
        namespace=event.namespace,
    )
```

***

## Rule Engine Implementation

Putting it together, the full processing pipeline looks like this:

{% @mermaid/diagram content="sequenceDiagram
participant WL as Watch-Loop
participant RE as Rule Engine
participant RD as Redis (cooldown)
participant PE as Pipeline
participant PB as Playbook Executor
participant RC as RCA Engine
participant NT as Notifier

```
WL->>RE: process(ClusterEvent)
RE->>RE: match rules in order

alt no rule matches
    RE-->>WL: logged, return
else rule matched
    RE->>RD: check cooldown key
    alt on cooldown
        RD-->>RE: key exists
        RE-->>WL: skip (cooldown active)
    else not on cooldown
        RD-->>RE: key missing
        RE->>PE: handle(AlertContext)
        
        par Parallel actions
            PE->>RC: run_rca(event) [if rca action]
            PE->>NT: notify(event, severity) [if notify action]
            PE->>PB: execute(playbook_name) [if playbook action]
        end
        
        PE->>RD: set cooldown key (TTL)
    end
end" %}
```

***

## Real Rules from the Project

These are the actual rules I run in my homelab cluster:

```yaml
# config/alert_rules.yml
rules:

  # CrashLoopBackOff in production — act immediately
  - name: "crash-loop-production"
    event_type: "crash_loop"
    conditions:
      namespace_pattern: "^production$"
      min_restarts: 2
    severity: "critical"
    actions:
      - "notify"
      - "rca"
      - "playbook:restart-crashed-pod"
    cooldown_seconds: 300

  # CrashLoopBackOff anywhere else — notify but don't remediate
  - name: "crash-loop-other"
    event_type: "crash_loop"
    conditions: {}
    severity: "high"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 600

  # OOMKilled — always act, anywhere
  - name: "oom-killed"
    event_type: "oom_killed"
    conditions: {}
    severity: "critical"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 600

  # Node not ready — must be down for 2+ minutes to avoid restart-triggered alerts
  - name: "node-not-ready"
    event_type: "not_ready_node"
    conditions:
      min_duration_seconds: 120
    severity: "critical"
    actions:
      - "notify"
      - "rca"
    cooldown_seconds: 300

  # Zero replicas on a deployment
  - name: "zero-replicas"
    event_type: "replication_failure"
    conditions:
      min_duration_seconds: 60
    severity: "high"
    actions:
      - "notify"
    cooldown_seconds: 300
```

***

## What I Learned

**Rule order matters more than you think.** I had a general `crash_loop` rule before a `production-only` crash loop rule, which meant the general rule always matched first and the production-specific cooldown and playbook never triggered. Moving specific rules before general rules is not optional.

**Namespace patterns prevent alert storms during development.** When I'm iterating on a service in `development`, pods crash constantly. Without a namespace filter or a significantly longer cooldown for non-production namespaces, my phone was buzzing every 30 seconds. Adding `min_restarts: 10` and `cooldown_seconds: 3600` for the development namespace made development usable again.

**Cooldown keys in Redis need to be inspectable.** When I suspected a rule wasn't firing, the first thing I needed was to know whether there was an active cooldown. Adding a command to list all active cooldoms was valuable:

```bash
# Check active cooldowns
redis-cli KEYS "cooldown:*" | sort
# Example output:
# cooldown:crash-loop-production:production:api-server-abc123
```

**The `rca` action is expensive.** It makes an LLM API call. I don't add it to every rule — only critical and high severity where the information is worth the \~10-second wait and the API cost. Low-severity rules get `notify` only.

**Load YAML rules at startup, not on every event.** My first implementation reloaded the YAML file on every event to support hot-reload. The file I/O and YAML parsing was measurable overhead. Now I reload only on SIGHUP or startup, which is fine for a homelab use case.

***

**Next**: [Article 5 — Playbooks and Human-in-the-Loop Approvals](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-playbooks-human-in-the-loop)
