# Article 1: What is AIOps? ## Introduction I used to run Prometheus and Grafana against my Kubernetes homelab and feel like I had observability covered. Then I watched a `CrashLoopBackOff` eat through my API server for forty minutes before I noticed the Slack alert I had buried in a low-priority channel. The alert had fired correctly. The graph was there. I just hadn't looked. That experience pushed me to think differently about the problem. Observability tells you what happened. AIOps is the layer that acts on what happened — detecting issues proactively, correlating noisy signals into coherent incidents, and either fixing them automatically or routing them to a human with enough context to resolve them quickly. This is Article 1 of the AIOps 101 series. It covers what AIOps actually means at the engineering level, the design goals I had when I started building [simple-ai-agent](https://github.com/Htunn/simple-ai-agent), and a tour of the full system we'll build through this series. ## Table of Contents 1. [What AIOps Actually Means](#what-aiops-actually-means) 2. [The Gap Between Alerts and Resolution](#the-gap-between-alerts-and-resolution) 3. [Design Goals for simple-ai-agent](#design-goals-for-simple-ai-agent) 4. [A Tour of the Final System](#a-tour-of-the-final-system) 5. [What This Series Covers](#what-this-series-covers) *** ## What AIOps Actually Means AIOps is a marketing term that gets applied to everything from Prometheus dashboards to fully automated incident response platforms. At its core, the useful definition is this: > **AIOps = applying ML/AI to operational data (metrics, logs, events) to reduce human toil in running software systems.** The "AI" part ranges from simple rule matching to LLM-powered root cause analysis. The "Ops" part means the system does something beyond just displaying information — it detects, decides, and acts. The spectrum looks like this: {% @mermaid/diagram content="graph LR A\[Static Alerts
Prometheus rules] -->|detect only| B\[Correlated Events
deduplicated incidents] B -->|detect + contextualize| C\[Automated Triage
severity, assignee] C -->|detect + contextualize + act| D\[Automated Remediation
playbooks + approvals] D -->|full loop| E\[LLM-Powered RCA
root cause + recommendation] ``` style A fill:#e0e0e0,color:#000 style B fill:#FFD580,color:#000 style C fill:#FFB347,color:#000 style D fill:#4ECDC4,color:#fff style E fill:#326CE5,color:#fff" %} ``` Most teams operate at A or B. The interesting engineering starts at C and D — where you're writing code that decides whether to act on cluster state and how to do it safely. ### What It Is Not * **Not a replacement for Prometheus/Grafana.** Those are still the source of truth for metrics. AIOps sits on top. In `simple-ai-agent`, Alertmanager delivers alerts *to* the AIOps engine. * **Not magic.** The watch-loop in my project polls the Kubernetes API every 30 seconds. There's nothing sophisticated about the detection mechanism — it's simple pattern matching against pod states. The "AI" comes into the RCA layer, not the detection layer. * **Not always automated.** The most important design decision in the whole project is that most remediation requires human approval. I'll return to this repeatedly. *** ## The Gap Between Alerts and Resolution Here's the problem I was actually solving. A typical alert-to-resolution flow without AIOps: ``` CrashLoopBackOff starts → Prometheus detects it (usually < 1 min) → Alertmanager fires → notification to Slack/PagerDuty → engineer sees it (could be seconds, could be hours) → engineer opens terminal → kubectl get pods, kubectl describe pod, kubectl logs → diagnose root cause → apply fix (restart, scale, config change) → verify resolution ``` Every step after the notification fires requires a human, a terminal, and context. If it's 2 AM, "engineer sees it" might be 45 minutes after the alert fires. If the engineer who sees it isn't familiar with that service, the triage takes longer. What I wanted: ``` CrashLoopBackOff starts → watch-loop detects it (< 30 sec) → rule engine classifies it (crash_loop, critical) → RCA engine queries cluster → generates root cause report → playbook executor suggests restart → approval manager posts to Telegram/Slack: "CrashLoopBackOff on api-server-xyz (production) Root cause: OOM, memory limit 256Mi hit 3x in 4h Action: restart pod [approve / reject]" → engineer approves with one message → pod restarted, confirmation posted ``` The engineer still made the decision. But the detection was instantaneous, the context was assembled automatically, and the remediation was one chat message instead of opening a terminal at 2 AM. *** ## Design Goals for simple-ai-agent When I started the project, I had a few explicit goals that shaped every architectural decision: ### 1. No Credentials in Chat I've seen people manage Kubernetes by pasting `kubectl` outputs into Slack and running commands from their laptop. That's fine for individual use, but the moment you share a bot with colleagues, you need proper auth boundaries. The agent holds the kubeconfig; users interact through the chat interface with no direct cluster access needed. ### 2. Human-in-the-Loop by Default Auto-remediation is seductive and dangerous. My rule: only `LOW` risk steps run automatically. Anything that touches production workloads — pod restarts, deployment rollbacks, config changes — requires an explicit `approve` message in chat. This is implemented in `src/services/approval_manager.py` with Redis-backed TTL approvals. I'll cover this in detail in [Article 5](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-playbooks-human-in-the-loop), but the short version: I've been burned by automation that was "safe" until it cascaded. Human approval adds 30 seconds to a response and prevents entire categories of self-inflicted outages. ### 3. LLM for Analysis, Not Control The LLM (Claude via Anthropic API) is responsible for two things: * Answering natural language questions about cluster state * Writing root cause analysis reports It does **not** execute `kubectl` commands directly. Tool execution goes through the MCP (Model Context Protocol) layer, which has explicit, enumerated tools with defined parameters. The LLM calls tools; it does not run shell commands. This distinction matters enormously for security. The LLM can tell the MCP layer to call `list_pods(namespace="production")`. It cannot `os.system("kubectl delete pod --all")`. ### 4. Observable at Every Layer When the watch-loop misses an event or an approval times out, I need to know why. `simple-ai-agent` exposes Prometheus metrics from the agent itself — watchloop run count, rule match count, playbook execution count, approval timeout count. The `/health` endpoint returns subsystem-level status for every component. ### 5. Works on a Single VM My homelab runs on a single Proxmox node. The whole stack — app, PostgreSQL, Redis, Prometheus, Grafana, Alertmanager — runs in Docker Compose. The design doesn't require Kubernetes to operate the agent (even though the agent manages Kubernetes). *** ## A Tour of the Final System Here's the full data flow through `simple-ai-agent`: {% @mermaid/diagram content="graph TB subgraph "User Channels" TG\[Telegram] SL\[Slack] end ``` subgraph "API Layer (FastAPI)" WH[/api/webhook/telegram
/api/webhook/slack] AW[/api/alert/webhook
Alertmanager receiver] H[/health /ready
/metrics] end subgraph "Business Logic" CH[Channel Router] MH[Message Handler
intent detection] SM[Session Manager
Redis] KH[Kubernetes Handler
NL parser] end subgraph "AI Layer" LLM[Anthropic API
claude-3-5-sonnet] MS[Model Selector] CB[Context Builder
conversation window] end subgraph "AIOps Layer" WL[Watch-Loop
30s polling] RE[Rule Engine
YAML rules] PE[Playbook Executor] AM[Approval Manager
Redis TTL] RCA[RCA Engine
LLM + SRE prompt] LA[Log Analyzer] end subgraph "MCP Layer" MM[MCP Manager] ST[stdio transport
K8s tools] SSE[SSE transport
Security tools] end subgraph "Data Layer" PG[(PostgreSQL 16
conversations + events)] RD[(Redis 7
sessions + approvals)] K8S[Kubernetes API] end TG --> WH SL --> WH WH --> CH --> MH MH --> SM MH --> LLM MH --> KH --> MM --> ST --> K8S MM --> SSE LLM --> CB --> PG WL --> K8S WL --> RE --> PE PE --> AM --> RD PE --> MM AM --> TG AM --> SL RE --> RCA --> LLM AW --> RE" %} ``` ### The Two Operational Modes The system operates in two modes simultaneously: **Reactive mode** (top half of the diagram): A user sends a message. The message handler detects intent — is it a Kubernetes query, a security scan request, a general question, or an approval response? It routes to the appropriate handler and returns a response. **Proactive mode** (bottom half): The watch-loop runs independently as an async background task. It polls Kubernetes every 30 seconds, feeds events to the rule engine, which may trigger RCA analysis and then playbook execution. If a playbook step requires approval, the approval manager posts to the configured chat channel and waits. The user's `approve abc123` message arrives in reactive mode, gets recognized as an approval response, and unblocks the playbook. This two-mode architecture is a key design choice. The AIOps engine doesn't block on user interaction — it runs continuously. User interaction happens to be the approval gate. *** ## What This Series Covers Here's what each article builds: | Article | Component | File(s) in project | | -------- | ------------------------ | ------------------------------------------------------------ | | 1 (this) | Overview | — | | 2 | Architecture & stack | `src/main.py`, `docker-compose.yml` | | 3 | Watch-Loop | `src/monitoring/watchloop.py`, `src/k8s/client.py` | | 4 | Rule Engine | `src/aiops/rule_engine.py`, `config/alert_rules.yml` | | 5 | Playbooks & Approvals | `src/aiops/playbooks.py`, `src/services/approval_manager.py` | | 6 | RCA Engine | `src/aiops/rca_engine.py`, `src/ai/prompt_manager.py` | | 7 | Alertmanager Integration | `src/api/webhooks.py`, `config/alertmanager.yml` | | 8 | Observability | `src/monitoring/prometheus.py`, `config/grafana/` | **Next**: [Article 2 — Architecture and Stack Decisions](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-architecture)