Article 1: What is AIOps?

Introduction

I used to run Prometheus and Grafana against my Kubernetes homelab and feel like I had observability covered. Then I watched a CrashLoopBackOff eat through my API server for forty minutes before I noticed the Slack alert I had buried in a low-priority channel. The alert had fired correctly. The graph was there. I just hadn't looked.

That experience pushed me to think differently about the problem. Observability tells you what happened. AIOps is the layer that acts on what happened — detecting issues proactively, correlating noisy signals into coherent incidents, and either fixing them automatically or routing them to a human with enough context to resolve them quickly.

This is Article 1 of the AIOps 101 series. It covers what AIOps actually means at the engineering level, the design goals I had when I started building simple-ai-agent, and a tour of the full system we'll build through this series.

What AIOps Actually Means

AIOps is a marketing term that gets applied to everything from Prometheus dashboards to fully automated incident response platforms. At its core, the useful definition is this:

AIOps = applying ML/AI to operational data (metrics, logs, events) to reduce human toil in running software systems.

The "AI" part ranges from simple rule matching to LLM-powered root cause analysis. The "Ops" part means the system does something beyond just displaying information — it detects, decides, and acts.

The spectrum looks like this:

Most teams operate at A or B. The interesting engineering starts at C and D — where you're writing code that decides whether to act on cluster state and how to do it safely.

What It Is Not

Not a replacement for Prometheus/Grafana. Those are still the source of truth for metrics. AIOps sits on top. In simple-ai-agent, Alertmanager delivers alerts to the AIOps engine.
Not magic. The watch-loop in my project polls the Kubernetes API every 30 seconds. There's nothing sophisticated about the detection mechanism — it's simple pattern matching against pod states. The "AI" comes into the RCA layer, not the detection layer.
Not always automated. The most important design decision in the whole project is that most remediation requires human approval. I'll return to this repeatedly.

The Gap Between Alerts and Resolution

Here's the problem I was actually solving. A typical alert-to-resolution flow without AIOps:

CrashLoopBackOff starts → Prometheus detects it (usually < 1 min)
    → Alertmanager fires → notification to Slack/PagerDuty
    → engineer sees it (could be seconds, could be hours)
    → engineer opens terminal
    → kubectl get pods, kubectl describe pod, kubectl logs
    → diagnose root cause
    → apply fix (restart, scale, config change)
    → verify resolution

Every step after the notification fires requires a human, a terminal, and context. If it's 2 AM, "engineer sees it" might be 45 minutes after the alert fires. If the engineer who sees it isn't familiar with that service, the triage takes longer.

What I wanted:

CrashLoopBackOff starts → watch-loop detects it (< 30 sec)
    → rule engine classifies it (crash_loop, critical)
    → RCA engine queries cluster → generates root cause report
    → playbook executor suggests restart
    → approval manager posts to Telegram/Slack:
        "CrashLoopBackOff on api-server-xyz (production)
         Root cause: OOM, memory limit 256Mi hit 3x in 4h
         Action: restart pod [approve / reject]"
    → engineer approves with one message
    → pod restarted, confirmation posted

The engineer still made the decision. But the detection was instantaneous, the context was assembled automatically, and the remediation was one chat message instead of opening a terminal at 2 AM.

Design Goals for simple-ai-agent

When I started the project, I had a few explicit goals that shaped every architectural decision:

1. No Credentials in Chat

I've seen people manage Kubernetes by pasting kubectl outputs into Slack and running commands from their laptop. That's fine for individual use, but the moment you share a bot with colleagues, you need proper auth boundaries. The agent holds the kubeconfig; users interact through the chat interface with no direct cluster access needed.

2. Human-in-the-Loop by Default

Auto-remediation is seductive and dangerous. My rule: only LOW risk steps run automatically. Anything that touches production workloads — pod restarts, deployment rollbacks, config changes — requires an explicit approve message in chat. This is implemented in src/services/approval_manager.py with Redis-backed TTL approvals.

I'll cover this in detail in Article 5, but the short version: I've been burned by automation that was "safe" until it cascaded. Human approval adds 30 seconds to a response and prevents entire categories of self-inflicted outages.

3. LLM for Analysis, Not Control

The LLM (Claude via Anthropic API) is responsible for two things:

Answering natural language questions about cluster state
Writing root cause analysis reports

It does not execute kubectl commands directly. Tool execution goes through the MCP (Model Context Protocol) layer, which has explicit, enumerated tools with defined parameters. The LLM calls tools; it does not run shell commands.

This distinction matters enormously for security. The LLM can tell the MCP layer to call list_pods(namespace="production"). It cannot os.system("kubectl delete pod --all").

4. Observable at Every Layer

When the watch-loop misses an event or an approval times out, I need to know why. simple-ai-agent exposes Prometheus metrics from the agent itself — watchloop run count, rule match count, playbook execution count, approval timeout count. The /health endpoint returns subsystem-level status for every component.

5. Works on a Single VM

My homelab runs on a single Proxmox node. The whole stack — app, PostgreSQL, Redis, Prometheus, Grafana, Alertmanager — runs in Docker Compose. The design doesn't require Kubernetes to operate the agent (even though the agent manages Kubernetes).

A Tour of the Final System

Here's the full data flow through simple-ai-agent:

The Two Operational Modes

The system operates in two modes simultaneously:

Reactive mode (top half of the diagram): A user sends a message. The message handler detects intent — is it a Kubernetes query, a security scan request, a general question, or an approval response? It routes to the appropriate handler and returns a response.

Proactive mode (bottom half): The watch-loop runs independently as an async background task. It polls Kubernetes every 30 seconds, feeds events to the rule engine, which may trigger RCA analysis and then playbook execution. If a playbook step requires approval, the approval manager posts to the configured chat channel and waits. The user's approve abc123 message arrives in reactive mode, gets recognized as an approval response, and unblocks the playbook.

This two-mode architecture is a key design choice. The AIOps engine doesn't block on user interaction — it runs continuously. User interaction happens to be the approval gate.

What This Series Covers

Here's what each article builds:

Article

Component

File(s) in project

1 (this)

Overview

—

Architecture & stack

src/main.py, docker-compose.yml

Watch-Loop

src/monitoring/watchloop.py, src/k8s/client.py

Rule Engine

src/aiops/rule_engine.py, config/alert_rules.yml

Playbooks & Approvals

src/aiops/playbooks.py, src/services/approval_manager.py

RCA Engine

src/aiops/rca_engine.py, src/ai/prompt_manager.py

Alertmanager Integration

src/api/webhooks.py, config/alertmanager.yml

Observability

src/monitoring/prometheus.py, config/grafana/

Next: Article 2 — Architecture and Stack Decisions

PreviousAIOps 101 NextArticle 2: Architecture and Stack Decisions

Last updated 20 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagWhat AIOps Actually Means

hashtagWhat It Is Not

hashtagThe Gap Between Alerts and Resolution

hashtagDesign Goals for simple-ai-agent

hashtag1. No Credentials in Chat

hashtag2. Human-in-the-Loop by Default

hashtag3. LLM for Analysis, Not Control

hashtag4. Observable at Every Layer

hashtag5. Works on a Single VM

hashtagA Tour of the Final System

hashtagThe Two Operational Modes

hashtagWhat This Series Covers