Article 6: LLM-Powered Root Cause Analysis

Introduction

When a pod crashes in production, the watch-loop detects it, the rule engine classifies it as critical, the notification arrives in Telegram. But the notification alone — "pod api-server-abc123 is in CrashLoopBackOff" — doesn't tell me why.

The RCA engine is the component that tries to answer "why". It gathers evidence from the cluster (pod logs, recent events, resource metrics), constructs a structured prompt for an LLM, and returns a diagnosis in a format the notification system can embed directly into the alert.

This article covers src/aiops/rca_engine.py and src/ai/prompt_manager.py from simple-ai-agent.

What RCA Means Here

Root cause analysis in the context of this project is not the formal incident postmortem process you'd run after a major outage. It's something smaller and faster: given a cluster anomaly that just fired, what's the most likely cause, what's the recommended next step, and how confident is the model in that assessment?

The output needs to be useful in a notification. It needs to be a paragraph, not a report. And it needs to be grounded in actual data from the cluster — not an LLM's general knowledge about Kubernetes.

Evidence Collection

Before any LLM call, the RCA engine collects evidence. Evidence is the raw material that goes into the prompt. Without it, the LLM is reasoning in a vacuum.

# src/aiops/rca_engine.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class RCAEvidence:
    namespace: str
    resource_name: str
    event_type: str
    pod_logs: list[str]           # Last N lines from the crashed container
    k8s_events: list[str]         # Recent kubectl describe events
    restart_count: int
    oom_killed: bool
    last_exit_code: Optional[int]
    node_conditions: list[str]    # If node event
    resource_requests: dict       # CPU/memory requests and limits
    previous_rca_summary: Optional[str]  # From Redis cache, if recent

async def collect_evidence(
    event: ClusterEvent,
    k8s_client: KubernetesClient,
    redis: Redis,
) -> RCAEvidence:
    
    # Gather pod logs, K8s events, and pod spec in parallel
    logs_task     = k8s_client.get_pod_logs(event.namespace, event.resource_name, tail_lines=50)
    events_task   = k8s_client.get_events(event.namespace, event.resource_name)
    pod_spec_task = k8s_client.get_pod(event.namespace, event.resource_name)
    
    logs, events, pod = await asyncio.gather(
        logs_task, events_task, pod_spec_task,
        return_exceptions=True,
    )
    
    # Handle partial failures — missing data is acceptable, stopping is not
    pod_logs   = logs   if isinstance(logs,   list) else []
    k8s_events = events if isinstance(events, list) else []
    
    # Check if we have a recent RCA for the same resource (within last 30 min)
    prev_key = f"rca_summary:{event.namespace}:{event.resource_name}"
    previous_rca_summary = await redis.get(prev_key)
    
    return RCAEvidence(
        namespace=event.namespace,
        resource_name=event.resource_name,
        event_type=event.event_type,
        pod_logs=pod_logs,
        k8s_events=k8s_events,
        restart_count=event.metadata.get("restart_count", 0),
        oom_killed=event.metadata.get("oom_killed", False),
        last_exit_code=event.metadata.get("last_exit_code"),
        node_conditions=event.metadata.get("node_conditions", []),
        resource_requests=_extract_resources(pod),
        previous_rca_summary=previous_rca_summary,
    )

Why Partial Failure is Acceptable

The Kubernetes API can fail for individual resources even when the cluster is healthy. A pod that's mid-restart may not have logs yet. A node event may occur while the node is unreachable. If evidence collection fails completely, the RCA falls back to a generic analysis using only the ClusterEvent metadata. A lower-quality RCA is better than no notification.

Prompt Design for SRE Context

The prompt is the most important part. A poorly structured prompt produces generic, unhelpful output. The SRE context prompt I use has four sections:

# src/ai/prompt_manager.py

SRE_RCA_SYSTEM_PROMPT = """
You are an experienced Site Reliability Engineer performing root cause analysis 
on a Kubernetes cluster event. You have access to pod logs, Kubernetes events, 
and resource metadata.

Your task is to:
1. Identify the most likely root cause of the issue
2. Provide a brief explanation (2-3 sentences) suitable for an alert notification
3. Suggest the single highest-priority next step for the on-call engineer
4. Rate your confidence in the diagnosis (high/medium/low)

Ground your analysis in the provided evidence. If the logs or events don't support 
a specific diagnosis, say so explicitly. Do not invent causes not supported by the data.

Output format: strict JSON only. No markdown, no explanation outside the JSON.
"""

def build_rca_prompt(evidence: RCAEvidence) -> str:
    sections = []
    
    sections.append(f"## Cluster Event\n"
                    f"Type: {evidence.event_type}\n"
                    f"Namespace: {evidence.namespace}\n"
                    f"Resource: {evidence.resource_name}\n"
                    f"Restart count: {evidence.restart_count}\n"
                    f"OOM killed: {evidence.oom_killed}\n"
                    f"Last exit code: {evidence.last_exit_code}")
    
    if evidence.resource_requests:
        sections.append(f"## Resource Configuration\n"
                        f"{format_resources(evidence.resource_requests)}")
    
    if evidence.k8s_events:
        event_text = "\n".join(f"  - {e}" for e in evidence.k8s_events[-10:])
        sections.append(f"## Recent Kubernetes Events\n{event_text}")
    
    if evidence.pod_logs:
        log_text = "\n".join(evidence.pod_logs[-30:])  # Last 30 lines only
        sections.append(f"## Recent Pod Logs (last 30 lines)\n{log_text}")
    
    if evidence.previous_rca_summary:
        sections.append(f"## Previous RCA for This Resource\n"
                        f"(from ~30 minutes ago)\n{evidence.previous_rca_summary}")
    
    sections.append("## Output Required\n"
                    "Return a JSON object with these exact keys:\n"
                    "- root_cause: string (likely cause)\n"
                    "- summary: string (2-3 sentence explanation for notification)\n"
                    "- next_step: string (highest-priority action)\n"
                    "- confidence: 'high' | 'medium' | 'low'\n"
                    "- evidence_used: list of strings (which evidence items informed the diagnosis)")
    
    return "\n\n".join(sections)

Key Design Choices

System prompt explicitly forbids inventing causes. Without this, LLMs will confidently produce plausible-sounding diagnoses with no basis in the actual logs. Grounding instructions matter.

Log lines are capped at 30. Sending 500 lines of logs into a prompt is expensive and usually counterproductive — the important signal is almost always in the last few lines, particularly the final exception before crash.

evidence_used field in output. This makes the diagnosis inspectable. If the model says "confidence: high" but evidence_used is empty, something is wrong. I can log this and investigate.

Previous RCA is included when available. If the same pod has crashed three times in the last hour and each RCA says "OOMKilled: memory limit 256Mi is too low", that pattern is more meaningful than any single incident.

Calling the Anthropic API

simple-ai-agent uses the Anthropic Python SDK directly. Install it with pip install anthropic and set ANTHROPIC_API_KEY in your environment.

# src/ai/llm_client.py
from anthropic import AsyncAnthropic
import os

class LLMClient:
    def __init__(self):
        self.client = AsyncAnthropic(
            api_key=os.environ["ANTHROPIC_API_KEY"],
        )
    
    async def complete(
        self,
        system_prompt: str,
        user_message: str,
        model: str = "claude-3-5-sonnet-20241022",
        max_tokens: int = 1024,
        temperature: float = 0.1,  # Low temp for factual analysis
    ) -> str:
        response = await self.client.messages.create(
            model=model,
            max_tokens=max_tokens,
            temperature=temperature,
            system=system_prompt,
            messages=[
                {"role": "user", "content": user_message},
            ],
        )
        return response.content[0].text

Using temperature=0.1 for RCA is deliberate — this is a factual analysis task, not creative generation. Low temperature produces more deterministic, reproducible outputs across multiple calls for the same input.

The Anthropic API doesn't have a response_format parameter like OpenAI. Instead, JSON output is enforced via the system prompt instruction — tell the model to respond only with a JSON object — and validated with json.loads() on the response. The parse_rca_response function already handles JSONDecodeError gracefully.

Model Used for RCA

Model

Notes

claude-3-5-sonnet-20241022

Primary model for all RCA — strong reasoning over structured technical evidence, good at following JSON output instructions consistently

Structured JSON Output

The LLM returns JSON. The RCA engine parses and validates it:

# src/aiops/rca_engine.py
import json
from dataclasses import dataclass

@dataclass
class RCAResult:
    root_cause: str
    summary: str
    next_step: str
    confidence: str   # 'high' | 'medium' | 'low'
    evidence_used: list[str]
    model_used: str
    raw_prompt_chars: int         # For cost estimation
    duration_ms: float            # For latency tracking

def parse_rca_response(raw: str, model: str, prompt_chars: int, duration_ms: float) -> RCAResult:
    try:
        data = json.loads(raw)
    except json.JSONDecodeError:
        # LLM returned non-JSON despite instructions
        return RCAResult(
            root_cause="Unable to parse RCA response",
            summary="The RCA engine returned an unparseable response. Manual investigation required.",
            next_step="Check pod logs manually with: kubectl logs -n <ns> <pod> --previous",
            confidence="low",
            evidence_used=[],
            model_used=model,
            raw_prompt_chars=prompt_chars,
            duration_ms=duration_ms,
        )
    
    return RCAResult(
        root_cause=data.get("root_cause", "Unknown"),
        summary=data.get("summary", ""),
        next_step=data.get("next_step", ""),
        confidence=data.get("confidence", "low"),
        evidence_used=data.get("evidence_used", []),
        model_used=model,
        raw_prompt_chars=prompt_chars,
        duration_ms=duration_ms,
    )

Confidence Scoring and Hallucination Mitigation

Low confidence does not mean low quality — it means the model correctly recognized that the evidence was insufficient to make a strong claim. I treat this as a feature, not a failure.

The RCA result includes confidence in the notification:

High confidence: Notification leads with the diagnosis
Medium confidence: Notification includes diagnosis with "(medium confidence)"
Low confidence: Notification says "Insufficient diagnostic data — manual review required" and lists what to check

This prevents the alert system from presenting uncertain analysis as certain fact.

Additional Guard Rails

# src/aiops/rca_engine.py

def sanity_check(result: RCAResult, evidence: RCAEvidence) -> RCAResult:
    """
    Apply basic sanity checks to the RCA result.
    Downgrade confidence if evidence_used is empty but confidence is high.
    """
    if not result.evidence_used and result.confidence == "high":
        # Model claims high confidence but cited no evidence — suspicious
        log.warning("rca.suspicious_confidence",
                    root_cause=result.root_cause,
                    evidence_count=len(result.evidence_used))
        return RCAResult(
            **{**result.__dict__, "confidence": "low"}
        )
    
    if len(evidence.pod_logs) == 0 and len(evidence.k8s_events) == 0:
        # No real evidence was available — label accordingly
        return RCAResult(
            **{**result.__dict__, "confidence": "low",
               "summary": result.summary + " (Note: limited cluster evidence available)"}
        )
    
    return result

RCA Engine Implementation

# src/aiops/rca_engine.py

class RCAEngine:
    def __init__(self, llm: LLMClient, k8s: KubernetesClient, redis: Redis):
        self.llm   = llm
        self.k8s   = k8s
        self.redis = redis
    
    async def analyze(self, event: ClusterEvent) -> RCAResult:
        start = asyncio.get_event_loop().time()
        
        log.info("rca.start", event_type=event.event_type, resource=event.resource_name)
        
        # Step 1: Collect evidence
        evidence = await collect_evidence(event, self.k8s, self.redis)
        
        # Step 2: Build prompt
        user_message = build_rca_prompt(evidence)
        prompt_chars = len(SRE_RCA_SYSTEM_PROMPT) + len(user_message)
        
        # Step 3: Call LLM
        try:
            raw = await self.llm.complete(
                system_prompt=SRE_RCA_SYSTEM_PROMPT,
                user_message=user_message,
                model="claude-3-5-sonnet-20241022",
                temperature=0.1,
            )
        except Exception as exc:
            log.error("rca.llm_failed", error=str(exc))
            return RCAResult(
                root_cause="LLM service unavailable",
                summary="RCA could not be performed due to an API error.",
                next_step="Check pod logs manually.",
                confidence="low",
                evidence_used=[],
                model_used="claude-3-5-sonnet-20241022",
                raw_prompt_chars=prompt_chars,
                duration_ms=0.0,
            )
        
        duration_ms = (asyncio.get_event_loop().time() - start) * 1000
        
        # Step 4: Parse and sanity-check
        result = parse_rca_response(raw, "claude-3-5-sonnet-20241022", prompt_chars, duration_ms)
        result = sanity_check(result, evidence)
        
        # Step 5: Cache summary for 30 minutes
        cache_key = f"rca_summary:{event.namespace}:{event.resource_name}"
        await self.redis.setex(cache_key, 1800, result.summary)
        
        log.info("rca.complete",
                 confidence=result.confidence,
                 duration_ms=duration_ms,
                 evidence_items=len(result.evidence_used))
        
        RCA_DURATION.observe(duration_ms)
        RCA_CONFIDENCE.labels(level=result.confidence).inc()
        
        return result

What a Real RCA Response Looks Like

Here's an example of what I see in Telegram when a pod crashes:

🔴 CRITICAL — CrashLoopBackOff
Namespace: production
Pod:       api-server-abc123-7f9b
Restarts:  6

Root Cause
Database connection pool exhausted. Logs show repeated 
"connection refused: postgresql:5432" after third restart. 
PostgreSQL pod is Running but connection limit (100) was 
reached during load spike.

Next Step
Check pg_stat_activity on the PostgreSQL pod:
kubectl exec -n production postgresql-0 -- psql -U app 
  -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state"

Confidence: medium · Model: claude-3-5-sonnet-20241022 · 8.4s

approve <id> to restart pod | reject <id> to skip

The medium confidence reflects that the LLM could diagnose the connection refused errors clearly, but couldn't confirm the root cause of the connection limit being reached without access to pg_stat_activity data.

What I Learned

The prompt is most of the work. I spent more time iterating the prompt than I spent writing the RCA engine code. Small changes in phrasing — especially in the system prompt — change output quality dramatically. "Ground your analysis in the provided evidence" reduced hallucination rate substantially compared to prompts without that constraint.

Log capping is necessary but blunt. Capping at 30 lines works most of the time because Kubernetes application logs usually contain the crash-causing exception near the end. But I've had cases where the relevant error was 200 lines before the final crash — a config parse error at startup that the app kept retrying silently. I'm considering semantic log compression (extract exception lines + last N lines) as an improvement.

LLM latency is real and must be factored into the UX. My RCA calls take 5–15 seconds. The notification appears immediately with "RCA in progress..." and then the message is updated with the analysis. This two-phase notification required restructuring how the notifier sent messages (edit-after-send via Telegram's editMessageText API) but was worth it — a 10-second blank pause in an alert notification was confusing.

Cache the summary, not the full result. I cache result.summary in Redis for 30 minutes per resource. If the same pod crashes five times in an hour, only the first crash gets a fresh LLM call. Subsequent crashes include the cached summary in context ("Previous RCA from 8 minutes ago: ...") so the second analysis can focus on what changed rather than re-diagnosing the same thing from scratch.

Next: Article 7 — Alertmanager Webhook Integration

PreviousArticle 5: Playbooks and Human-in-the-Loop Approvals NextArticle 7: Alertmanager Webhook Integration

Last updated 20 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagWhat RCA Means Here

hashtagEvidence Collection

hashtagWhy Partial Failure is Acceptable

hashtagPrompt Design for SRE Context

hashtagKey Design Choices

hashtagCalling the Anthropic API

hashtagModel Used for RCA

hashtagStructured JSON Output

hashtagConfidence Scoring and Hallucination Mitigation

hashtagAdditional Guard Rails

hashtagRCA Engine Implementation

hashtagWhat a Real RCA Response Looks Like

hashtagWhat I Learned