Article 6: LLM-Powered Root Cause Analysis

Introduction

When a pod crashes in production, the watch-loop detects it, the rule engine classifies it as critical, the notification arrives in Telegram. But the notification alone β€” "pod api-server-abc123 is in CrashLoopBackOff" β€” doesn't tell me why.

The RCA engine is the component that tries to answer "why". It gathers evidence from the cluster (pod logs, recent events, resource metrics), constructs a structured prompt for an LLM, and returns a diagnosis in a format the notification system can embed directly into the alert.

This article covers src/aiops/rca_engine.py and src/ai/prompt_manager.py from simple-ai-agentarrow-up-right.


Table of Contents


What RCA Means Here

Root cause analysis in the context of this project is not the formal incident postmortem process you'd run after a major outage. It's something smaller and faster: given a cluster anomaly that just fired, what's the most likely cause, what's the recommended next step, and how confident is the model in that assessment?

The output needs to be useful in a notification. It needs to be a paragraph, not a report. And it needs to be grounded in actual data from the cluster β€” not an LLM's general knowledge about Kubernetes.


Evidence Collection

Before any LLM call, the RCA engine collects evidence. Evidence is the raw material that goes into the prompt. Without it, the LLM is reasoning in a vacuum.

Why Partial Failure is Acceptable

The Kubernetes API can fail for individual resources even when the cluster is healthy. A pod that's mid-restart may not have logs yet. A node event may occur while the node is unreachable. If evidence collection fails completely, the RCA falls back to a generic analysis using only the ClusterEvent metadata. A lower-quality RCA is better than no notification.


Prompt Design for SRE Context

The prompt is the most important part. A poorly structured prompt produces generic, unhelpful output. The SRE context prompt I use has four sections:

Key Design Choices

System prompt explicitly forbids inventing causes. Without this, LLMs will confidently produce plausible-sounding diagnoses with no basis in the actual logs. Grounding instructions matter.

Log lines are capped at 30. Sending 500 lines of logs into a prompt is expensive and usually counterproductive β€” the important signal is almost always in the last few lines, particularly the final exception before crash.

evidence_used field in output. This makes the diagnosis inspectable. If the model says "confidence: high" but evidence_used is empty, something is wrong. I can log this and investigate.

Previous RCA is included when available. If the same pod has crashed three times in the last hour and each RCA says "OOMKilled: memory limit 256Mi is too low", that pattern is more meaningful than any single incident.


Calling the Anthropic API

simple-ai-agent uses the Anthropic Python SDK directly. Install it with pip install anthropic and set ANTHROPIC_API_KEY in your environment.

Using temperature=0.1 for RCA is deliberate β€” this is a factual analysis task, not creative generation. Low temperature produces more deterministic, reproducible outputs across multiple calls for the same input.

The Anthropic API doesn't have a response_format parameter like OpenAI. Instead, JSON output is enforced via the system prompt instruction β€” tell the model to respond only with a JSON object β€” and validated with json.loads() on the response. The parse_rca_response function already handles JSONDecodeError gracefully.

Model Used for RCA

Model
Notes

claude-3-5-sonnet-20241022

Primary model for all RCA β€” strong reasoning over structured technical evidence, good at following JSON output instructions consistently


Structured JSON Output

The LLM returns JSON. The RCA engine parses and validates it:


Confidence Scoring and Hallucination Mitigation

Low confidence does not mean low quality β€” it means the model correctly recognized that the evidence was insufficient to make a strong claim. I treat this as a feature, not a failure.

The RCA result includes confidence in the notification:

  • High confidence: Notification leads with the diagnosis

  • Medium confidence: Notification includes diagnosis with "(medium confidence)"

  • Low confidence: Notification says "Insufficient diagnostic data β€” manual review required" and lists what to check

This prevents the alert system from presenting uncertain analysis as certain fact.

Additional Guard Rails


RCA Engine Implementation


What a Real RCA Response Looks Like

Here's an example of what I see in Telegram when a pod crashes:

The medium confidence reflects that the LLM could diagnose the connection refused errors clearly, but couldn't confirm the root cause of the connection limit being reached without access to pg_stat_activity data.


What I Learned

The prompt is most of the work. I spent more time iterating the prompt than I spent writing the RCA engine code. Small changes in phrasing β€” especially in the system prompt β€” change output quality dramatically. "Ground your analysis in the provided evidence" reduced hallucination rate substantially compared to prompts without that constraint.

Log capping is necessary but blunt. Capping at 30 lines works most of the time because Kubernetes application logs usually contain the crash-causing exception near the end. But I've had cases where the relevant error was 200 lines before the final crash β€” a config parse error at startup that the app kept retrying silently. I'm considering semantic log compression (extract exception lines + last N lines) as an improvement.

LLM latency is real and must be factored into the UX. My RCA calls take 5–15 seconds. The notification appears immediately with "RCA in progress..." and then the message is updated with the analysis. This two-phase notification required restructuring how the notifier sent messages (edit-after-send via Telegram's editMessageText API) but was worth it β€” a 10-second blank pause in an alert notification was confusing.

Cache the summary, not the full result. I cache result.summary in Redis for 30 minutes per resource. If the same pod crashes five times in an hour, only the first crash gets a fresh LLM call. Subsequent crashes include the cached summary in context ("Previous RCA from 8 minutes ago: ...") so the second analysis can focus on what changed rather than re-diagnosing the same thing from scratch.


Next: Article 7 β€” Alertmanager Webhook Integration

Last updated