AIOps 101
A hands-on series on building AIOps capabilities using Python, Kubernetes, LLMs, and the tools I actually use in my personal projects. Every article draws directly from my open-source project simple-ai-agent β a production-ready AI agent that connects Telegram and Slack to Kubernetes cluster management, automated remediation, and LLM-powered root cause analysis.
What This Series Is About
AIOps gets thrown around a lot as a buzzword. In practice, it means using AI and automation to reduce the toil of operating software systems β catching incidents before humans notice, correlating alerts into coherent events, and executing remediation steps with appropriate human oversight.
This series is not about the buzzword. It's about the engineering: what the components are, how they wire together, what the failure modes look like in real clusters, and how to implement them from scratch using tools you likely already have.
I built simple-ai-agent as a personal project to manage my own Kubernetes homelab through Telegram and Slack. What started as a chat bot that could list pods grew into a full AIOps engine β a background watch-loop detecting CrashLoopBackOff and OOMKilled events, a YAML rule engine, playbook-based remediation with risk gating, and an LLM that writes SRE-quality root cause analysis reports. This series documents that journey and the design decisions behind every component.
Project Reference
All code examples, architecture diagrams, and configuration snippets in this series reference the actual implementation at:
github.com/Htunn/simple-ai-agent
The project is MIT-licensed. You can run it locally, adapt it for your own infrastructure, and follow along with real code rather than toy examples.
Who This Series Is For
SREs and platform engineers who manage Kubernetes and want to reduce on-call toil
Backend engineers curious how LLMs integrate with operational tooling
Hobbyists running a homelab and want something smarter than static Prometheus alerts
Anyone building their own AI agent and wanting to extend it toward operations
Prerequisites
Comfortable with Python (async/await, Pydantic, FastAPI basics)
Basic Kubernetes knowledge (pods, deployments, events,
kubectl)Familiarity with Prometheus and the concept of metrics/alerts
A Kubernetes cluster to experiment against (local
kindorminikubeworks fine)
Series Structure
Phase 1: Foundations (Articles 1β2)
Understanding what AIOps means in practice and how the overall system is architected.
What is AIOps β and Why I Built One
The gap between Prometheus alerts and actual resolution
What an AIOps system actually does (watch, detect, decide, act)
My goals when starting simple-ai-agent
A tour of the final system
Architecture and Stack Decisions
Layered architecture: Channel β API β Business Logic β AIOps β Data
Why FastAPI, PostgreSQL, Redis for this problem
MCP (Model Context Protocol) for tool execution
How the chat channels (Telegram/Slack) connect to the operations backend
Phase 2: The AIOps Engine (Articles 3β6)
Building the four core components β watch-loop, rule engine, playbook executor, and RCA engine.
The Watch-Loop β Continuous Cluster Health Polling
The background polling pattern and why pull beats push for homelab use
What cluster events to watch for: CrashLoopBackOff, OOMKilled, NotReady nodes, zero replicas
ClusterEventdata model designGraceful startup, shutdown, and error containment
The Rule Engine β Turning Events into Actionable Alerts
YAML-defined alert rules and severity mapping
Pattern matching against
ClusterEventobjectsAvoiding alert storms: deduplication and cooldown windows
Real rule examples from
config/alert_rules.yml
Playbooks and Human-in-the-Loop Approvals
Ordered remediation step sequences
Risk levels: LOW (auto-execute) / MEDIUM (gate on approval) / HIGH (warn only)
Redis-backed approval manager with TTL expiry
Chat-native
approve / rejectpatternWhy full auto-remediation is almost always wrong
LLM-Powered Root Cause Analysis
Designing the SRE prompt for structured JSON output
Anthropic API:
claude-3-5-sonnet-20241022Confidence scoring and evidence extraction
Integrating RCA reports into the chat response
Hallucination mitigation: grounding the LLM with real cluster data
Phase 3: Integration and Observability (Articles 7β8)
Wiring the AIOps engine into the production alerting pipeline and making the system itself observable.
Alertmanager Integration β Bringing Prometheus Alerts into the Agent
POST /api/alert/webhookreceiver designAlertmanager routing configuration (
alertmanager.yml)Translating Prometheus alert labels into
ClusterEventobjectsAlert deduplication and resolved-alert handling
End-to-end flow: metric β Prometheus rule β Alertmanager β agent β playbook
Observability β Monitoring Your AIOps Agent
structlogJSON logging and why it matters for alert correlationPrometheus metrics from the agent itself
Health endpoint design: subsystem-level status
Grafana dashboard setup
Debugging the watch-loop and approval flow
How to Follow Along
Clone the reference project before starting:
Each article references specific files in the repository using paths like src/aiops/watchloop.py, so you can read the real implementation alongside the explanation.
Series File Structure
Ready? Start with Article 1: What is AIOps.
Last updated