# Part 5: RAG, Fine-Tuning, and Prompt Engineering

*Part of the* [*AI Fundamentals 101 Series*](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101)

## The Problem: Foundation Models Don't Know Your Stuff

Foundation models are impressive — but out of the box, they have three fundamental gaps:

1. **They don't know your data.** Claude has never seen your internal runbooks, your architecture diagrams, or your Kubernetes cluster configuration.
2. **Their knowledge is frozen.** Whatever was in the training data is all they know. They can't tell you about last night's incident.
3. **They're general-purpose.** They know a little about everything, but they're not specialized in your domain.

There are three strategies to bridge these gaps, and choosing the right one is one of the most important decisions in AI engineering. I've used all three in my own projects, and each has a sweet spot.

***

## The Three Strategies at a Glance

```python
customization_strategies = {
    "Prompt Engineering": {
        "what": "Craft better instructions and provide examples in the prompt",
        "when": "Always — this is your first tool",
        "cost": "Free (no training, no infrastructure)",
        "data_needed": "None to a few examples",
        "latency_impact": "Minimal (slightly longer prompts)",
        "best_for": "Formatting, tone, task definition, few-shot learning"
    },
    "RAG (Retrieval-Augmented Generation)": {
        "what": "Retrieve relevant documents and include them in the prompt",
        "when": "Model needs access to your specific data or current information",
        "cost": "Moderate (vector DB, embedding compute)",
        "data_needed": "Your documents/knowledge base",
        "latency_impact": "Moderate (retrieval step adds 100-500ms)",
        "best_for": "Q&A over docs, knowledge bases, current data access"
    },
    "Fine-Tuning": {
        "what": "Further train the model on your task-specific data",
        "when": "Need consistent behavior that prompting can't achieve",
        "cost": "High (GPU compute, labeled data, ongoing maintenance)",
        "data_needed": "Hundreds to thousands of labeled examples",
        "latency_impact": "None (model runs at same speed)",
        "best_for": "Specialized tone, domain-specific patterns, consistent formatting"
    }
}

for strategy, details in customization_strategies.items():
    print(f"\n{'='*50}")
    print(f"  {strategy}")
    print(f"  When: {details['when']}")
    print(f"  Cost: {details['cost']}")
    print(f"  Best for: {details['best_for']}")
```

***

## Strategy 1: Prompt Engineering

**Prompt engineering** is the practice of designing effective inputs to get better outputs from an LLM. It's not guesswork — it's a systematic approach to communicating with the model.

### Basic Techniques

```python
# Technique 1: Be specific
# Bad prompt:
bad_prompt = "Tell me about servers"
# Could mean anything: hardware, software, configuration, history...

# Good prompt:
good_prompt = """Explain the difference between horizontal and vertical scaling 
for a Kubernetes deployment running a stateless API service. 
Include when to use each approach and the trade-offs."""

# Technique 2: Provide context
context_prompt = """You are a DevOps engineer reviewing infrastructure metrics.

Current cluster state:
- 3 nodes, each with 8 CPU cores and 32GB RAM
- 45 pods running across production and staging namespaces
- Node-1 CPU at 92%, Nodes 2-3 at 45%

Based on these metrics, what actions should I take?"""

# Technique 3: Specify output format
format_prompt = """Analyze this error log and return a JSON response with:
- severity: "critical" | "warning" | "info"
- category: "compute" | "network" | "storage"
- root_cause: one-sentence explanation
- action: recommended fix

Error log: Pod payment-svc-7d8b OOMKilled, memory limit 512Mi exceeded

Return only the JSON, no markdown formatting."""
```

### Few-Shot Prompting

Provide examples of input→output pairs so the model understands the pattern:

```python
few_shot_prompt = """Classify these Kubernetes events by severity.

Examples:
Event: "Pod scheduled on node-2" → Severity: info
Event: "Container image pulled successfully" → Severity: info
Event: "Readiness probe failed, 3 consecutive errors" → Severity: warning
Event: "OOMKilled: container exceeded memory limit" → Severity: critical
Event: "Node disk pressure detected" → Severity: critical

Now classify:
Event: "Liveness probe failed for container api-server" → Severity:"""

# The model will respond: "warning" or "critical"
# because it learned the pattern from the examples
```

### Chain-of-Thought Prompting

Ask the model to reason step-by-step:

```python
cot_prompt = """A Kubernetes pod is periodically restarting (CrashLoopBackOff). 
The pod runs a Java application with these resource limits:
- Memory limit: 512Mi
- CPU limit: 500m
- JVM heap: -Xmx256m

The last 3 restarts show exit code 137 (OOMKilled).

Think through this step by step:
1. What does exit code 137 mean?
2. What's the relationship between the JVM heap and memory limit?
3. What other memory consumers exist in a Java container?
4. What's the likely root cause?
5. What's the recommended fix?"""

# Without chain-of-thought, the model might jump to "increase memory"
# With CoT, it reasons through the Java memory model:
# - JVM heap is 256Mi, but total JVM memory includes metaspace, threads, GC overhead
# - Total JVM memory can be 1.5-2x heap size
# - 512Mi limit is tight for a 256Mi heap
# - Fix: either lower heap or increase memory limit
```

### System Prompts

The system prompt defines the model's behavior for an entire conversation:

```python
system_prompt = """You are a senior DevOps engineer specializing in Kubernetes 
and cloud infrastructure. You give concise, actionable answers.

Rules:
- Always suggest the simplest solution first
- Include kubectl commands when applicable  
- Warn about production safety (always use --dry-run first)
- If you're unsure, say so instead of guessing
- Format commands as code blocks

Context: The user manages a production Kubernetes cluster on AWS EKS
with 5 nodes and ~100 pods across 3 namespaces."""

# This system prompt shapes every response in the conversation.
# The model will consistently give Kubernetes-focused, production-safe answers.
```

***

## Strategy 2: RAG (Retrieval-Augmented Generation)

**RAG** combines information retrieval with text generation. Instead of relying on what the model learned during training, you retrieve relevant documents from your own data and include them in the prompt.

### How RAG Works

```
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  User Query   │────▶│   Retriever   │────▶│  Retrieved    │
│               │     │  (vector DB)  │     │  Documents    │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Response    │◀────│   Generator   │◀────│ Prompt with   │
│               │     │   (LLM)       │     │ context + Q   │
└──────────────┘     └──────────────┘     └──────────────┘
```

### The RAG Pipeline Step by Step

```python
# Step 1: Prepare your documents (one-time setup)
import numpy as np

# Simulating document preparation
documents = [
    {
        "id": 1,
        "title": "Scaling payment-service",
        "content": "The payment-service should be scaled to minimum 3 replicas. "
                   "Use HPA with CPU target 70%. Memory limit 1Gi per pod.",
    },
    {
        "id": 2,
        "title": "Database connection pooling",
        "content": "PostgreSQL connection pool max is 100 per pod. "
                   "Use PgBouncer for connection multiplexing in production.",
    },
    {
        "id": 3,
        "title": "Incident response for OOMKilled",
        "content": "OOMKilled indicates container exceeded memory limit. "
                   "Steps: 1) Check current memory usage with kubectl top. "
                   "2) Review memory limits in deployment spec. "
                   "3) Profile application memory with pprof or JVM tools.",
    },
]

# Step 2: Create embeddings (convert text to vectors)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# In production, you'd use sentence-transformers or an API for embeddings
# Here we use TF-IDF for illustration
vectorizer = TfidfVectorizer()
doc_texts = [d["content"] for d in documents]
doc_vectors = vectorizer.fit_transform(doc_texts)

# Step 3: Retrieve relevant documents for a query
def retrieve(query: str, top_k: int = 2) -> list[dict]:
    """Find the most relevant documents for a query."""
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(query_vector, doc_vectors)[0]

    # Get top-k most similar documents
    top_indices = similarities.argsort()[-top_k:][::-1]

    results = []
    for idx in top_indices:
        results.append({
            "document": documents[idx],
            "similarity": similarities[idx]
        })
    return results

# Step 4: Build the augmented prompt
def build_rag_prompt(query: str) -> str:
    """Retrieve context and build the prompt."""
    retrieved = retrieve(query, top_k=2)

    context = "\n\n".join([
        f"Document: {r['document']['title']}\n{r['document']['content']}"
        for r in retrieved
        if r["similarity"] > 0.1  # Only include if somewhat relevant
    ])

    return f"""Answer the question based on the provided context.
If the context doesn't contain enough information, say so.

Context:
{context}

Question: {query}

Answer:"""

# Test it
query = "A pod keeps getting OOMKilled. What should I do?"
prompt = build_rag_prompt(query)
print(prompt)
```

Output:

```
Answer the question based on the provided context.
If the context doesn't contain enough information, say so.

Context:
Document: Incident response for OOMKilled
OOMKilled indicates container exceeded memory limit. Steps: 1) Check current 
memory usage with kubectl top. 2) Review memory limits in deployment spec. 
3) Profile application memory with pprof or JVM tools.

Document: Scaling payment-service
The payment-service should be scaled to minimum 3 replicas. Use HPA with CPU 
target 70%. Memory limit 1Gi per pod.

Question: A pod keeps getting OOMKilled. What should I do?

Answer:
```

**The key insight:** The LLM now has your specific documentation as context. It will answer based on *your* runbooks, not generic knowledge.

### Multimodal RAG

Standard RAG works with text. **Multimodal RAG** extends this to images, tables, diagrams, and more.

```python
# Multimodal RAG: retrieving and reasoning over multiple data types

multimodal_rag_sources = {
    "Text Documents": {
        "examples": ["Runbooks", "Architecture docs", "API documentation"],
        "embedding": "Text embeddings (sentence-transformers)",
        "retrieval": "Semantic search over text chunks"
    },
    "Tables and CSVs": {
        "examples": ["Metrics dashboards", "SLA data", "Capacity plans"],
        "embedding": "Convert to text descriptions, then embed",
        "retrieval": "Match queries to table descriptions"
    },
    "Images and Diagrams": {
        "examples": ["Architecture diagrams", "Network topology", "Dashboards"],
        "embedding": "Vision embeddings (CLIP) or OCR + text embedding",
        "retrieval": "Match visual concepts or extracted text"
    },
    "Code": {
        "examples": ["Kubernetes manifests", "Terraform configs", "Scripts"],
        "embedding": "Code-specific embeddings (CodeBERT, StarCoder)",
        "retrieval": "Semantic search over code patterns"
    }
}

# In practice, multimodal RAG means:
# User: "Show me the architecture for the payment system"
# System: 
#   1. Retrieves the architecture diagram (image)
#   2. Retrieves the related text documentation
#   3. Sends both to a multimodal LLM (Claude 3, GPT-4V)
#   4. LLM generates an answer referencing both the diagram and docs
```

***

## Strategy 3: Fine-Tuning

**Fine-tuning** means further training a pre-trained model on your task-specific data. The model's weights are actually updated — it learns new patterns specific to your domain.

### When Fine-Tuning Makes Sense

```python
fine_tuning_scenarios = {
    "Good use cases": [
        "Consistent output format that prompting can't achieve reliably",
        "Domain-specific language (medical, legal, internal jargon)",
        "Specific tone/style across all responses",
        "Reducing token usage (fine-tuned model needs shorter prompts)",
        "Classified/sensitive data that can't leave your infrastructure"
    ],
    "Bad use cases": [
        "Just need access to specific documents (use RAG instead)",
        "Don't have enough examples (need 100+ minimum, 1000+ ideal)",
        "The task changes frequently (retraining is expensive)",
        "Prompt engineering already works well enough"
    ]
}

for category, items in fine_tuning_scenarios.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  - {item}")
```

### Fine-Tuning Conceptual Example

```python
# Fine-tuning training data format (for classification)
# Each example teaches the model the behavior you want

training_examples = [
    {
        "messages": [
            {"role": "system", "content": "Classify infrastructure alerts."},
            {"role": "user", "content": "Pod payment-svc OOMKilled, memory at 98%"},
            {"role": "assistant", "content": '{"severity": "critical", "category": "compute", "action": "increase_memory_limit"}'}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "Classify infrastructure alerts."},
            {"role": "user", "content": "Certificate expiring in 30 days"},
            {"role": "assistant", "content": '{"severity": "info", "category": "security", "action": "schedule_renewal"}'}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "Classify infrastructure alerts."},
            {"role": "user", "content": "Network latency spike to 500ms on us-east-1"},
            {"role": "assistant", "content": '{"severity": "warning", "category": "network", "action": "investigate_routing"}'}
        ]
    },
    # ... hundreds or thousands more examples
]

# After fine-tuning, the model consistently:
# 1. Returns valid JSON (learned the format)
# 2. Uses your severity definitions (learned your standards)
# 3. Suggests actions from your runbook (learned your procedures)
# 4. Needs a much shorter prompt (the knowledge is in the weights)
```

### Fine-Tuning vs RAG: The Decision

```python
# The simplest way to decide:

decision_tree = """
Do you need the model to access specific documents/data?
├── YES → Use RAG
│   (The model needs INFORMATION it doesn't have)
│
└── NO → Does prompting get the right output format/style?
    ├── YES → Use Prompt Engineering
    │   (The model CAN do it, it just needs instructions)
    │
    └── NO → Use Fine-Tuning
        (The model needs to LEARN a new behavior)
"""

print(decision_tree)
```

***

## Comparing All Three: Same Task, Three Approaches

Let's see how each strategy handles the same task: analyzing infrastructure health.

### Prompt Engineering Approach

```python
prompt_engineering_approach = """You are an infrastructure health analyst.

Given the following metrics, provide a health assessment.
Format your response as:
- Status: HEALTHY / WARNING / CRITICAL
- Issues: list any problems
- Actions: recommended steps

Metrics:
  CPU: 92%
  Memory: 78%
  Disk: 45%
  Error Rate: 8.5%
  Response Time p99: 2.3s
"""

# Pros: Free, instant, flexible
# Cons: Long prompt, model might not follow format consistently,
#        doesn't know your specific thresholds or runbooks
```

### RAG Approach

```python
# First, retrieve relevant context from your knowledge base
retrieved_context = """
[From: runbook-thresholds.md]
CPU > 85%: WARNING. CPU > 95%: CRITICAL.
Memory > 80%: WARNING. Memory > 90%: CRITICAL.
Error Rate > 5%: WARNING. Error Rate > 10%: CRITICAL.
p99 > 2s: WARNING. p99 > 5s: CRITICAL.

[From: incident-response.md]
For CPU warnings: Check for recent deployments, runaway processes.
Use 'kubectl top pods' to identify resource-heavy pods.
For Error Rate warnings: Check application logs for new error patterns.
Verify downstream dependencies are healthy.
"""

rag_prompt = f"""Based on the context below, analyze the infrastructure health.

Context:
{retrieved_context}

Metrics:
  CPU: 92%
  Memory: 78%
  Disk: 45%
  Error Rate: 8.5%
  Response Time p99: 2.3s

Provide assessment using the thresholds and procedures from the context."""

# Pros: Uses YOUR thresholds and procedures, always current
# Cons: Retrieval adds latency, need to maintain the knowledge base
```

### Fine-Tuning Approach

```python
# After fine-tuning on 1000+ metric→assessment pairs from your team:

fine_tuned_prompt = """CPU: 92%, Memory: 78%, Disk: 45%, Error Rate: 8.5%, p99: 2.3s"""

# The fine-tuned model already knows:
# - Your threshold definitions (learned from training data)
# - Your output format (consistently JSON)
# - Your action recommendations (learned from historical assessments)
# - Your severity scale (matches your Slack alert format)

# Expected output (consistent, no need for long instructions):
# {"status": "WARNING", "issues": ["CPU at 92% exceeds warning threshold",
#  "Error rate 8.5% above 5% baseline", "p99 2.3s exceeds 2s SLO"],
#  "actions": ["kubectl top pods -n production", "check recent deployments",
#  "review error logs in Grafana dashboard"]}

# Pros: Short prompt, consistent output, fast
# Cons: Expensive to create train data, needs retraining when thresholds change
```

### Side-by-Side Comparison

```python
comparison = {
    "Metric": ["Setup Time", "Per-Request Cost", "Latency",
               "Accuracy", "Maintainability", "Data Freshness"],
    "Prompt Engineering": ["Minutes", "Low (just prompt tokens)",
                           "Fast", "Good", "Easy (edit prompt)",
                           "Depends on model's training"],
    "RAG": ["Days", "Medium (embed + retrieve + generate)",
            "Moderate (+100-500ms)", "Very Good",
            "Moderate (maintain KB)", "Always current"],
    "Fine-Tuning": ["Weeks", "Low (short prompts)",
                    "Fast", "Excellent (for trained patterns)",
                    "Hard (retrain for changes)",
                    "Frozen at training time"]
}

# Print as table
header = f"{'Metric':<20} {'Prompt Eng.':<25} {'RAG':<30} {'Fine-Tuning':<25}"
print(header)
print("-" * len(header))
for i, metric in enumerate(comparison["Metric"]):
    print(f"{metric:<20} {comparison['Prompt Engineering'][i]:<25} "
          f"{comparison['RAG'][i]:<30} {comparison['Fine-Tuning'][i]:<25}")
```

***

## Combining Strategies: The Practical Approach

In my projects, I rarely use just one strategy. Here's how I combine them:

```python
# My actual approach for the home lab monitoring assistant:

# Layer 1: Prompt Engineering (always)
# - System prompt defines the assistant's role and output format
# - Few-shot examples show the expected analysis style

# Layer 2: RAG (for knowledge)
# - Retrieves relevant runbooks for the specific alert type
# - Retrieves recent incident history for similar patterns
# - Retrieves current infrastructure configuration

# Layer 3: Fine-tuning is NOT used here (not worth the investment
# for a personal project — prompt engineering + RAG is sufficient)

system_prompt = """You are a DevOps monitoring assistant for a home lab 
Kubernetes cluster. Analyze alerts using the provided context from runbooks 
and recent history.

Output format:
1. Severity assessment (info/warning/critical)
2. Root cause analysis (based on metrics + context)
3. Recommended actions (from runbooks when available)
4. Related recent incidents (if any)"""

# RAG retrieves context based on the alert
retrieved_context = retrieve_relevant_docs(alert_text)

# Final prompt combines prompt engineering + RAG
final_prompt = f"""{system_prompt}

Relevant runbooks and history:
{retrieved_context}

Current alert:
{alert_text}

Analysis:"""

# This gives me 90% of the benefit at 10% of the complexity
# of a fully fine-tuned, production-grade system
```

**The takeaway:** Start with prompt engineering. Add RAG when the model needs your data. Consider fine-tuning only when the other two aren't enough. Most projects never need fine-tuning.

***

## What's Next

Now that you understand how to customize AI models, we'll explore the most exciting frontier: **AI Agents** — systems that don't just generate text, but take actions. We'll cover agent architectures, communication protocols (MCP, A2A, gRPC), and human-in-the-loop patterns.

***

*Next:* [*Part 6 — AI Agents and Communication Protocols*](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101/part-6-ai-agents-and-protocols)

***

[← Part 4: LLMs and Generative AI](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101/part-4-llms-and-generative-ai) · [Series Overview](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101) · [Next →](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101/part-6-ai-agents-and-protocols)
