# AIOps 101

A hands-on series on building AIOps capabilities using Python, Kubernetes, LLMs, and the tools I actually use in my personal projects. Every article draws directly from my open-source project [simple-ai-agent](https://github.com/Htunn/simple-ai-agent) — a production-ready AI agent that connects Telegram and Slack to Kubernetes cluster management, automated remediation, and LLM-powered root cause analysis.

## What This Series Is About

AIOps gets thrown around a lot as a buzzword. In practice, it means using AI and automation to reduce the toil of operating software systems — catching incidents before humans notice, correlating alerts into coherent events, and executing remediation steps with appropriate human oversight.

This series is not about the buzzword. It's about the engineering: what the components are, how they wire together, what the failure modes look like in real clusters, and how to implement them from scratch using tools you likely already have.

I built `simple-ai-agent` as a personal project to manage my own Kubernetes homelab through Telegram and Slack. What started as a chat bot that could list pods grew into a full AIOps engine — a background watch-loop detecting `CrashLoopBackOff` and `OOMKilled` events, a YAML rule engine, playbook-based remediation with risk gating, and an LLM that writes SRE-quality root cause analysis reports. This series documents that journey and the design decisions behind every component.

## Project Reference

All code examples, architecture diagrams, and configuration snippets in this series reference the actual implementation at:

[**github.com/Htunn/simple-ai-agent**](https://github.com/Htunn/simple-ai-agent)

The project is MIT-licensed. You can run it locally, adapt it for your own infrastructure, and follow along with real code rather than toy examples.

## Who This Series Is For

* **SREs and platform engineers** who manage Kubernetes and want to reduce on-call toil
* **Backend engineers** curious how LLMs integrate with operational tooling
* **Hobbyists** running a homelab and want something smarter than static Prometheus alerts
* **Anyone** building their own AI agent and wanting to extend it toward operations

## Prerequisites

* Comfortable with Python (async/await, Pydantic, FastAPI basics)
* Basic Kubernetes knowledge (pods, deployments, events, `kubectl`)
* Familiarity with Prometheus and the concept of metrics/alerts
* A Kubernetes cluster to experiment against (local `kind` or `minikube` works fine)

## Series Structure

### Phase 1: Foundations (Articles 1–2)

Understanding what AIOps means in practice and how the overall system is architected.

1. [**What is AIOps — and Why I Built One**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-what-is-aiops)
   * The gap between Prometheus alerts and actual resolution
   * What an AIOps system actually does (watch, detect, decide, act)
   * My goals when starting simple-ai-agent
   * A tour of the final system
2. [**Architecture and Stack Decisions**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-architecture)
   * Layered architecture: Channel → API → Business Logic → AIOps → Data
   * Why FastAPI, PostgreSQL, Redis for this problem
   * MCP (Model Context Protocol) for tool execution
   * How the chat channels (Telegram/Slack) connect to the operations backend

### Phase 2: The AIOps Engine (Articles 3–6)

Building the four core components — watch-loop, rule engine, playbook executor, and RCA engine.

3. [**The Watch-Loop — Continuous Cluster Health Polling**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-watch-loop)
   * The background polling pattern and why pull beats push for homelab use
   * What cluster events to watch for: CrashLoopBackOff, OOMKilled, NotReady nodes, zero replicas
   * `ClusterEvent` data model design
   * Graceful startup, shutdown, and error containment
4. [**The Rule Engine — Turning Events into Actionable Alerts**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-rule-engine)
   * YAML-defined alert rules and severity mapping
   * Pattern matching against `ClusterEvent` objects
   * Avoiding alert storms: deduplication and cooldown windows
   * Real rule examples from `config/alert_rules.yml`
5. [**Playbooks and Human-in-the-Loop Approvals**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-playbooks-human-in-the-loop)
   * Ordered remediation step sequences
   * Risk levels: LOW (auto-execute) / MEDIUM (gate on approval) / HIGH (warn only)
   * Redis-backed approval manager with TTL expiry
   * Chat-native `approve / reject` pattern
   * Why full auto-remediation is almost always wrong
6. [**LLM-Powered Root Cause Analysis**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-rca-engine)
   * Designing the SRE prompt for structured JSON output
   * Anthropic API: `claude-3-5-sonnet-20241022`
   * Confidence scoring and evidence extraction
   * Integrating RCA reports into the chat response
   * Hallucination mitigation: grounding the LLM with real cluster data

### Phase 3: Integration and Observability (Articles 7–8)

Wiring the AIOps engine into the production alerting pipeline and making the system itself observable.

7. [**Alertmanager Integration — Bringing Prometheus Alerts into the Agent**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-alertmanager-integration)
   * `POST /api/alert/webhook` receiver design
   * Alertmanager routing configuration (`alertmanager.yml`)
   * Translating Prometheus alert labels into `ClusterEvent` objects
   * Alert deduplication and resolved-alert handling
   * End-to-end flow: metric → Prometheus rule → Alertmanager → agent → playbook
8. [**Observability — Monitoring Your AIOps Agent**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-observability)
   * `structlog` JSON logging and why it matters for alert correlation
   * Prometheus metrics from the agent itself
   * Health endpoint design: subsystem-level status
   * Grafana dashboard setup
   * Debugging the watch-loop and approval flow

## How to Follow Along

Clone the reference project before starting:

```bash
git clone https://github.com/Htunn/simple-ai-agent.git
cd simple-ai-agent
cp .env.example .env
# Fill in ANTHROPIC_API_KEY and one bot token (Telegram or Slack), then:
docker compose up -d postgres redis
./scripts/start_server.sh
```

Each article references specific files in the repository using paths like `src/aiops/watchloop.py`, so you can read the real implementation alongside the explanation.

## Series File Structure

```
artificial-intelligence/aiops-101/
├── README.md                                (this file)
├── aiops-101-what-is-aiops.md               (Article 1)
├── aiops-101-architecture.md                (Article 2)
├── aiops-101-watch-loop.md                  (Article 3)
├── aiops-101-rule-engine.md                 (Article 4)
├── aiops-101-playbooks-human-in-the-loop.md (Article 5)
├── aiops-101-rca-engine.md                  (Article 6)
├── aiops-101-alertmanager-integration.md    (Article 7)
└── aiops-101-observability.md               (Article 8)
```

***

Ready? Start with [Article 1: What is AIOps](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/aiops-101/aiops-101-what-is-aiops).
