# Article 1: What is AIOps?

## Introduction

I used to run Prometheus and Grafana against my Kubernetes homelab and feel like I had observability covered. Then I watched a `CrashLoopBackOff` eat through my API server for forty minutes before I noticed the Slack alert I had buried in a low-priority channel. The alert had fired correctly. The graph was there. I just hadn't looked.

That experience pushed me to think differently about the problem. Observability tells you what happened. AIOps is the layer that acts on what happened — detecting issues proactively, correlating noisy signals into coherent incidents, and either fixing them automatically or routing them to a human with enough context to resolve them quickly.

This is Article 1 of the AIOps 101 series. It covers what AIOps actually means at the engineering level, the design goals I had when I started building [simple-ai-agent](https://github.com/Htunn/simple-ai-agent), and a tour of the full system we'll build through this series.

## Table of Contents

1. [What AIOps Actually Means](#what-aiops-actually-means)
2. [The Gap Between Alerts and Resolution](#the-gap-between-alerts-and-resolution)
3. [Design Goals for simple-ai-agent](#design-goals-for-simple-ai-agent)
4. [A Tour of the Final System](#a-tour-of-the-final-system)
5. [What This Series Covers](#what-this-series-covers)

***

## What AIOps Actually Means

AIOps is a marketing term that gets applied to everything from Prometheus dashboards to fully automated incident response platforms. At its core, the useful definition is this:

> **AIOps = applying ML/AI to operational data (metrics, logs, events) to reduce human toil in running software systems.**

The "AI" part ranges from simple rule matching to LLM-powered root cause analysis. The "Ops" part means the system does something beyond just displaying information — it detects, decides, and acts.

The spectrum looks like this:

{% @mermaid/diagram content="graph LR
A\[Static Alerts<br/>Prometheus rules] -->|detect only| B\[Correlated Events<br/>deduplicated incidents]
B -->|detect + contextualize| C\[Automated Triage<br/>severity, assignee]
C -->|detect + contextualize + act| D\[Automated Remediation<br/>playbooks + approvals]
D -->|full loop| E\[LLM-Powered RCA<br/>root cause + recommendation]

```
style A fill:#e0e0e0,color:#000
style B fill:#FFD580,color:#000
style C fill:#FFB347,color:#000
style D fill:#4ECDC4,color:#fff
style E fill:#326CE5,color:#fff" %}
```

Most teams operate at A or B. The interesting engineering starts at C and D — where you're writing code that decides whether to act on cluster state and how to do it safely.

### What It Is Not

* **Not a replacement for Prometheus/Grafana.** Those are still the source of truth for metrics. AIOps sits on top. In `simple-ai-agent`, Alertmanager delivers alerts *to* the AIOps engine.
* **Not magic.** The watch-loop in my project polls the Kubernetes API every 30 seconds. There's nothing sophisticated about the detection mechanism — it's simple pattern matching against pod states. The "AI" comes into the RCA layer, not the detection layer.
* **Not always automated.** The most important design decision in the whole project is that most remediation requires human approval. I'll return to this repeatedly.

***

## The Gap Between Alerts and Resolution

Here's the problem I was actually solving. A typical alert-to-resolution flow without AIOps:

```
CrashLoopBackOff starts → Prometheus detects it (usually < 1 min)
    → Alertmanager fires → notification to Slack/PagerDuty
    → engineer sees it (could be seconds, could be hours)
    → engineer opens terminal
    → kubectl get pods, kubectl describe pod, kubectl logs
    → diagnose root cause
    → apply fix (restart, scale, config change)
    → verify resolution
```

Every step after the notification fires requires a human, a terminal, and context. If it's 2 AM, "engineer sees it" might be 45 minutes after the alert fires. If the engineer who sees it isn't familiar with that service, the triage takes longer.

What I wanted:

```
CrashLoopBackOff starts → watch-loop detects it (< 30 sec)
    → rule engine classifies it (crash_loop, critical)
    → RCA engine queries cluster → generates root cause report
    → playbook executor suggests restart
    → approval manager posts to Telegram/Slack:
        "CrashLoopBackOff on api-server-xyz (production)
         Root cause: OOM, memory limit 256Mi hit 3x in 4h
         Action: restart pod [approve / reject]"
    → engineer approves with one message
    → pod restarted, confirmation posted
```

The engineer still made the decision. But the detection was instantaneous, the context was assembled automatically, and the remediation was one chat message instead of opening a terminal at 2 AM.

***

## Design Goals for simple-ai-agent

When I started the project, I had a few explicit goals that shaped every architectural decision:

### 1. No Credentials in Chat

I've seen people manage Kubernetes by pasting `kubectl` outputs into Slack and running commands from their laptop. That's fine for individual use, but the moment you share a bot with colleagues, you need proper auth boundaries. The agent holds the kubeconfig; users interact through the chat interface with no direct cluster access needed.

### 2. Human-in-the-Loop by Default

Auto-remediation is seductive and dangerous. My rule: only `LOW` risk steps run automatically. Anything that touches production workloads — pod restarts, deployment rollbacks, config changes — requires an explicit `approve` message in chat. This is implemented in `src/services/approval_manager.py` with Redis-backed TTL approvals.

I'll cover this in detail in [Article 5](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-playbooks-human-in-the-loop), but the short version: I've been burned by automation that was "safe" until it cascaded. Human approval adds 30 seconds to a response and prevents entire categories of self-inflicted outages.

### 3. LLM for Analysis, Not Control

The LLM (Claude via Anthropic API) is responsible for two things:

* Answering natural language questions about cluster state
* Writing root cause analysis reports

It does **not** execute `kubectl` commands directly. Tool execution goes through the MCP (Model Context Protocol) layer, which has explicit, enumerated tools with defined parameters. The LLM calls tools; it does not run shell commands.

This distinction matters enormously for security. The LLM can tell the MCP layer to call `list_pods(namespace="production")`. It cannot `os.system("kubectl delete pod --all")`.

### 4. Observable at Every Layer

When the watch-loop misses an event or an approval times out, I need to know why. `simple-ai-agent` exposes Prometheus metrics from the agent itself — watchloop run count, rule match count, playbook execution count, approval timeout count. The `/health` endpoint returns subsystem-level status for every component.

### 5. Works on a Single VM

My homelab runs on a single Proxmox node. The whole stack — app, PostgreSQL, Redis, Prometheus, Grafana, Alertmanager — runs in Docker Compose. The design doesn't require Kubernetes to operate the agent (even though the agent manages Kubernetes).

***

## A Tour of the Final System

Here's the full data flow through `simple-ai-agent`:

{% @mermaid/diagram content="graph TB
subgraph "User Channels"
TG\[Telegram]
SL\[Slack]
end

```
subgraph "API Layer (FastAPI)"
    WH[/api/webhook/telegram<br/>/api/webhook/slack]
    AW[/api/alert/webhook<br/>Alertmanager receiver]
    H[/health /ready<br/>/metrics]
end

subgraph "Business Logic"
    CH[Channel Router]
    MH[Message Handler<br/>intent detection]
    SM[Session Manager<br/>Redis]
    KH[Kubernetes Handler<br/>NL parser]
end

subgraph "AI Layer"
    LLM[Anthropic API<br/>claude-3-5-sonnet]
    MS[Model Selector]
    CB[Context Builder<br/>conversation window]
end

subgraph "AIOps Layer"
    WL[Watch-Loop<br/>30s polling]
    RE[Rule Engine<br/>YAML rules]
    PE[Playbook Executor]
    AM[Approval Manager<br/>Redis TTL]
    RCA[RCA Engine<br/>LLM + SRE prompt]
    LA[Log Analyzer]
end

subgraph "MCP Layer"
    MM[MCP Manager]
    ST[stdio transport<br/>K8s tools]
    SSE[SSE transport<br/>Security tools]
end

subgraph "Data Layer"
    PG[(PostgreSQL 16<br/>conversations + events)]
    RD[(Redis 7<br/>sessions + approvals)]
    K8S[Kubernetes API]
end

TG --> WH
SL --> WH
WH --> CH --> MH
MH --> SM
MH --> LLM
MH --> KH --> MM --> ST --> K8S
MM --> SSE
LLM --> CB --> PG

WL --> K8S
WL --> RE --> PE
PE --> AM --> RD
PE --> MM
AM --> TG
AM --> SL

RE --> RCA --> LLM
AW --> RE" %}
```

### The Two Operational Modes

The system operates in two modes simultaneously:

**Reactive mode** (top half of the diagram): A user sends a message. The message handler detects intent — is it a Kubernetes query, a security scan request, a general question, or an approval response? It routes to the appropriate handler and returns a response.

**Proactive mode** (bottom half): The watch-loop runs independently as an async background task. It polls Kubernetes every 30 seconds, feeds events to the rule engine, which may trigger RCA analysis and then playbook execution. If a playbook step requires approval, the approval manager posts to the configured chat channel and waits. The user's `approve abc123` message arrives in reactive mode, gets recognized as an approval response, and unblocks the playbook.

This two-mode architecture is a key design choice. The AIOps engine doesn't block on user interaction — it runs continuously. User interaction happens to be the approval gate.

***

## What This Series Covers

Here's what each article builds:

| Article  | Component                | File(s) in project                                           |
| -------- | ------------------------ | ------------------------------------------------------------ |
| 1 (this) | Overview                 | —                                                            |
| 2        | Architecture & stack     | `src/main.py`, `docker-compose.yml`                          |
| 3        | Watch-Loop               | `src/monitoring/watchloop.py`, `src/k8s/client.py`           |
| 4        | Rule Engine              | `src/aiops/rule_engine.py`, `config/alert_rules.yml`         |
| 5        | Playbooks & Approvals    | `src/aiops/playbooks.py`, `src/services/approval_manager.py` |
| 6        | RCA Engine               | `src/aiops/rca_engine.py`, `src/ai/prompt_manager.py`        |
| 7        | Alertmanager Integration | `src/api/webhooks.py`, `config/alertmanager.yml`             |
| 8        | Observability            | `src/monitoring/prometheus.py`, `config/grafana/`            |

**Next**: [Article 2 — Architecture and Stack Decisions](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-architecture)
