# Part 4: Large Language Models and Generative AI

*Part of the* [*AI Fundamentals 101 Series*](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101)

## The First Time I Saw GPT-3 Generate Code

I was debugging a Kubernetes networking issue late one night. Out of frustration, I pasted the error log into GPT-3 (this was early 2023) and asked "what's wrong?" The model responded with a correct diagnosis — a misconfigured NetworkPolicy — and suggested the exact fix.

That moment changed how I think about tools. Not because the AI "understood" Kubernetes networking, but because a model trained on internet text contained enough patterns about Kubernetes error messages to produce a useful answer. Understanding *how* that's possible — and where it breaks down — is what this article is about.

***

## What is a Large Language Model?

A **Large Language Model (LLM)** is a neural network trained on massive amounts of text data to predict the next word in a sequence. That's the core mechanism — next-word prediction — but it emerges into something that looks like understanding, reasoning, and creativity.

```
Input:  "The capital of France is"
Model:  "Paris" (highest probability next token)

Input:  "def fibonacci(n):"
Model:  "\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)"
```

The model doesn't "know" that Paris is a city or that Fibonacci is a mathematical sequence. It has learned statistical patterns from billions of text examples that make these continuations the most likely.

### What Makes Them "Large"?

Three dimensions of scale:

```python
llm_scale = {
    "parameters": {
        "GPT-2 (2019)": "1.5 billion",
        "GPT-3 (2020)": "175 billion",
        "GPT-4 (2023)": "~1.8 trillion (estimated)",
        "LLaMA 3 (2024)": "8B / 70B / 405B variants",
    },
    "training_data": {
        "GPT-2": "40GB of internet text",
        "GPT-3": "~570GB (300B tokens)",
        "LLaMA 3": "15 trillion tokens",
    },
    "compute": {
        "GPT-3": "~3.6 million GPU-hours",
        "GPT-4": "estimated $100M+ in compute",
    }
}

for dimension, models in llm_scale.items():
    print(f"\n{dimension.upper()}:")
    for model, value in models.items():
        print(f"  {model}: {value}")
```

**Parameters** are the learned weights inside the model. More parameters = more capacity to store patterns. A model with 175 billion parameters has 175 billion numbers that were adjusted during training to minimize prediction error.

But bigger isn't always better. **LLaMA 3 8B** (8 billion parameters) often outperforms **GPT-3** (175 billion) because it was trained on better data with better techniques. Data quality and training methodology matter more than raw size.

***

## How Transformers Work — Plain Language

Every major LLM is built on the **transformer architecture**, introduced in the 2017 paper "Attention Is All You Need." Here's how it works without the math.

### The Core Idea: Attention

Before transformers, models processed text word-by-word (RNNs). The problem: by the time you reach the end of a long sentence, you've partially "forgotten" the beginning.

Transformers solve this with **attention** — the model can look at all words simultaneously and decide which ones are most relevant to each other.

```
Sentence: "The server that runs the payment service crashed because its memory was full"

When processing "its", the model needs to know "its" refers to "server"
(not "service" or "payment")

Attention scores for "its":
  "server"  → 0.45  (highest — "its" refers to the server)
  "service" → 0.15
  "payment" → 0.08
  "crashed" → 0.12
  "memory"  → 0.10
  "full"    → 0.05
  others    → 0.05
```

The model computes these attention scores for every word relative to every other word, in parallel. This is why transformers are fast (parallelizable on GPUs) and good at understanding context.

### The Transformer Architecture — Simplified

```
┌─────────────────────────────────────────┐
│  Input: "The server crashed because"     │
├─────────────────────────────────────────┤
│  1. Tokenization                         │
│     → [The] [server] [crashed] [because] │
├─────────────────────────────────────────┤
│  2. Token Embeddings                     │
│     Each token → a vector of numbers     │
│     [The] → [0.12, -0.34, 0.56, ...]    │
├─────────────────────────────────────────┤
│  3. Positional Encoding                  │
│     Add position info so model knows     │
│     word order (transformers don't have  │
│     sequential processing like RNNs)     │
├─────────────────────────────────────────┤
│  4. Attention Layers (×N times)          │
│     Each token attends to all others     │
│     Learns relationships and context     │
├─────────────────────────────────────────┤
│  5. Feed-Forward Layers                  │
│     Process the attention output         │
│     through dense neural network layers  │
├─────────────────────────────────────────┤
│  6. Output Probabilities                 │
│     → "its": 0.25, "memory": 0.18,      │
│       "the": 0.12, "it": 0.10, ...      │
│     → Select "its" (or sample)           │
└─────────────────────────────────────────┘
```

### Tokens and Context Windows

LLMs don't see words — they see **tokens**. A token is a piece of text, roughly 3-4 characters for English.

```python
# Approximate token counting (for intuition)
def estimate_tokens(text: str) -> int:
    """Rough estimate: ~4 characters per token for English."""
    return len(text) // 4

# Context window examples
context_windows = {
    "GPT-3.5": 4_096,      # ~3,000 words
    "GPT-4": 128_000,       # ~96,000 words
    "Claude 3.5": 200_000,  # ~150,000 words
    "LLaMA 3": 128_000,     # ~96,000 words
}

sample_text = """
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-service
"""

tokens = estimate_tokens(sample_text)
print(f"This YAML is roughly {tokens} tokens\n")

for model, window in context_windows.items():
    pages = window * 4 // 2000  # Rough: 2000 chars per page
    print(f"{model}: {window:,} tokens (~{pages} pages of text)")
```

**Why this matters:** The context window is the model's working memory for a single conversation. If your prompt + conversation exceeds it, earlier content gets truncated. This is why I learned to be concise with prompts — fitting more relevant context into fewer tokens gives better results.

***

## Generative AI: Beyond Text

LLMs are the most visible part of **Generative AI**, but the field is broader.

### Types of Generative AI

```python
generative_ai_types = {
    "Text Generation": {
        "models": ["GPT-4", "Claude", "LLaMA", "Gemini"],
        "use_cases": [
            "Content writing",
            "Code generation",
            "Summarization",
            "Translation"
        ],
        "how": "Predicts next token based on transformer architecture"
    },
    "Image Generation": {
        "models": ["DALL-E 3", "Stable Diffusion", "Midjourney"],
        "use_cases": [
            "Art creation",
            "Product mockups",
            "Image editing"
        ],
        "how": "Diffusion models: start with noise, iteratively refine to image"
    },
    "Code Generation": {
        "models": ["Copilot", "Claude Code", "CodeLlama"],
        "use_cases": [
            "Autocomplete",
            "Code explanation",
            "Bug fixing",
            "Test generation"
        ],
        "how": "LLMs trained on code repositories"
    },
    "Audio Generation": {
        "models": ["Whisper (STT)", "ElevenLabs (TTS)", "Suno (music)"],
        "use_cases": [
            "Speech-to-text",
            "Text-to-speech",
            "Music composition"
        ],
        "how": "Specialized transformers for audio waveforms"
    },
    "Multimodal": {
        "models": ["GPT-4V", "Gemini", "Claude 3"],
        "use_cases": [
            "Image + text understanding",
            "Video analysis",
            "Document parsing"
        ],
        "how": "Single model processes multiple input types"
    }
}

for category, details in generative_ai_types.items():
    print(f"\n{category}")
    print(f"  Models: {', '.join(details['models'])}")
    print(f"  How:    {details['how']}")
```

### How Text Generation Actually Works

When an LLM generates text, it's not retrieving pre-written answers. It's predicting one token at a time:

```python
# Conceptual illustration of text generation

def generate_text_conceptual(prompt: str, max_tokens: int = 10):
    """How LLMs generate text (simplified)."""
    generated = prompt

    for _ in range(max_tokens):
        # Model looks at ALL generated text so far
        # and predicts probability of every possible next token
        # (vocabulary is ~50,000–100,000 tokens)

        next_token_probabilities = {
            "crashed": 0.25,
            "is": 0.15,
            "was": 0.12,
            "running": 0.10,
            "responded": 0.08,
            # ... 50,000 more options
        }

        # Selection strategy depends on temperature:
        # temperature=0: always pick the highest probability (deterministic)
        # temperature=0.7: sample proportionally (creative but coherent)
        # temperature=1.5: sample more uniformly (very random)

        selected_token = select_by_temperature(
            next_token_probabilities, temperature=0.7
        )
        generated += " " + selected_token

    return generated
```

### Temperature, Top-p, and Sampling

These parameters control **how** the model selects the next token:

```python
# Temperature controls randomness
generation_params = {
    "temperature_0": {
        "value": 0.0,
        "behavior": "Always pick the most probable token",
        "result": "Deterministic, repetitive, safe",
        "use_for": "Code generation, factual Q&A, structured output"
    },
    "temperature_0.7": {
        "value": 0.7,
        "behavior": "Mostly pick probable tokens, sometimes surprise",
        "result": "Balanced creativity and coherence",
        "use_for": "General conversation, writing, brainstorming"
    },
    "temperature_1.5": {
        "value": 1.5,
        "behavior": "Flatten probabilities, lots of randomness",
        "result": "Creative but often incoherent",
        "use_for": "Poetry, creative fiction (rarely used this high)"
    }
}

# Top-p (nucleus sampling): only consider tokens whose cumulative
# probability adds up to p
# If top_p=0.9: only consider tokens in the top 90% of probability mass
# This dynamic cutoff adapts to each prediction step

# In my projects, I use:
# - temperature=0 for code generation and structured output
# - temperature=0.3 for log analysis (some flexibility. mostly factual)
# - temperature=0.7 for generating documentation or explanations
```

***

## The Cost Equation: Why LLMs Are Expensive

Understanding LLM costs is essential for building real systems. I learned this the hard way when my first RAG prototype's API bill was 10x what I budgeted.

### What Makes LLMs Expensive

```python
llm_cost_factors = {
    "Training": {
        "what": "Training from scratch on massive data",
        "cost": "GPT-4 estimated at $100M+",
        "who_pays": "The model provider (OpenAI, Anthropic, Meta)",
        "your_cost": "$0 (unless you're fine-tuning)"
    },
    "Inference_compute": {
        "what": "Running the model on your input",
        "cost": "Each request uses GPU time proportional to input + output tokens",
        "who_pays": "You (via API pricing)",
        "your_cost": "$0.001-0.10 per 1K tokens (varies by model)"
    },
    "Context_window": {
        "what": "Longer inputs = more compute per request",
        "cost": "Processing 100K tokens costs ~25x more than 4K tokens",
        "who_pays": "You",
        "your_cost": "Design prompts to be concise"
    },
    "Output_length": {
        "what": "Generating more text costs more",
        "cost": "Output tokens are typically 3-5x more expensive than input",
        "who_pays": "You",
        "your_cost": "Set max_tokens to limit response length"
    }
}

# Cost comparison for a real task: analyzing a Kubernetes incident
example_analysis = {
    "Input tokens": 2000,  # Error logs + cluster state + prompt
    "Output tokens": 500,   # Root cause analysis + recommendation
}

# Model pricing (approximate, per 1M tokens)
pricing = {
    "GPT-4 Turbo": {"input": 10.00, "output": 30.00},
    "Claude 3.5 Sonnet": {"input": 3.00, "output": 15.00},
    "GPT-4o Mini": {"input": 0.15, "output": 0.60},
    "LLaMA 3 70B (self-hosted)": {"input": 0.00, "output": 0.00},
}

print("Cost per Kubernetes incident analysis:")
for model, prices in pricing.items():
    input_cost = (example_analysis["Input tokens"] / 1_000_000) * prices["input"]
    output_cost = (example_analysis["Output tokens"] / 1_000_000) * prices["output"]
    total = input_cost + output_cost
    daily_cost = total * 100  # 100 incidents/day
    monthly_cost = daily_cost * 30
    print(f"  {model}")
    print(f"    Per analysis: ${total:.5f}")
    print(f"    Per month (100/day): ${monthly_cost:.2f}")
```

### Prompt Caching: Reducing Cost and Latency

**Prompt caching** is a technique where the model provider caches the computed representations of your system prompt and repeated context, so you don't pay to process them on every request.

```python
# Without prompt caching: every request processes the full prompt
# Request 1: [system prompt (1000 tokens)] + [user message (100 tokens)]
# Request 2: [system prompt (1000 tokens)] + [user message (150 tokens)]
# Request 3: [system prompt (1000 tokens)] + [user message (80 tokens)]
# Total input tokens processed: 3000 + 330 = 3330

# With prompt caching: system prompt processed once, cached
# Request 1: [system prompt (1000 tokens, computed)] + [user message (100 tokens)]
# Request 2: [system prompt (cached, nearly free)] + [user message (150 tokens)]
# Request 3: [system prompt (cached, nearly free)] + [user message (80 tokens)]
# Total input tokens processed: 1000 + 330 = 1330 (60% reduction)

# Anthropic charges 90% less for cached tokens
# This means: large system prompts become practical
```

**When I use caching:** My log analysis tool has a 1,500-token system prompt that describes our infrastructure, naming conventions, and common failure modes. Without caching, I'd pay for those 1,500 tokens on every API call. With caching, I pay once.

***

## Limitations: What LLMs Can't Do

This section exists because I learned every one of these the hard way.

### 1. Hallucination

LLMs generate plausible-sounding text that's factually wrong. They don't "know" things — they predict likely token sequences.

```python
# Example of hallucination risk
# If you ask: "What Kubernetes version introduced the PodMonitor CRD?"
# An LLM might confidently say "Kubernetes 1.23" 
# when in fact PodMonitor is from the Prometheus Operator, not Kubernetes itself

# Mitigation: always verify factual claims
# In my projects, I never trust LLM output for:
# - Version numbers
# - API endpoints
# - Configuration syntax
# - Security-sensitive decisions
```

### 2. Reasoning Gaps

LLMs can appear to reason but sometimes fail on problems that require genuine logical inference.

```python
# LLMs struggle with certain types of reasoning:

# Spatial reasoning: 
# "I put a book on the table, put a cup on the book, 
#  then flipped the table. Where is the cup?"
# LLMs often get this wrong.

# Multi-step math:
# "If I have 7 servers, each with 32GB RAM, and I need to reserve 
#  25% for the OS, how much total application RAM do I have?"
# LLMs can do this but sometimes make arithmetic errors.

# Lesson: use LLMs for language tasks (summarization, classification, 
# generation) and use actual code for computation.

# My approach: let the LLM generate the formula, but execute it in Python
def calculate_available_ram(servers: int, ram_per_server: int, os_reserve: float) -> float:
    """Don't ask an LLM to do math. Use Python."""
    total_ram = servers * ram_per_server
    reserved = total_ram * os_reserve
    return total_ram - reserved

print(f"Available RAM: {calculate_available_ram(7, 32, 0.25)}GB")
# Available RAM: 168.0GB
```

### 3. Knowledge Cutoff

LLMs only know what was in their training data. They can't access current information unless given it through context (RAG) or tool use.

```python
# Knowledge cutoff means the model doesn't know about:
# - Events after its training date
# - Your internal documentation
# - Your specific infrastructure
# - Recent CVEs, new API versions, current prices

# This is exactly why RAG exists (covered in Part 5):
# Instead of relying on the model's training data,
# you RETRIEVE relevant context and include it in the prompt

knowledge_cutoff_issues = [
    "What's the latest Kubernetes version?",     # Stale answer
    "What CVEs affect our current stack?",        # No internal knowledge
    "What happened in production last night?",    # No access to real-time data
    "What's the current price of GPT-4o?",        # Pricing changes frequently
]
```

### 4. Determinism

Same input can produce different outputs (unless temperature=0), making testing and debugging harder.

```python
# Non-deterministic: run the same prompt 3 times, get 3 different answers
# This is fine for creative tasks but problematic for:
# - Automated pipelines where you parse the output
# - Tests that need reproducible results
# - Compliance-sensitive applications

# My approach: 
# 1. Use temperature=0 for structured/deterministic outputs
# 2. Use structured output (JSON mode) to get parseable responses
# 3. Add validation — never trust the shape of LLM output

import json

def safe_parse_llm_response(response: str) -> dict:
    """Always validate LLM output before using it."""
    try:
        parsed = json.loads(response)
        # Validate expected fields exist
        required_fields = ["severity", "category", "summary"]
        for field in required_fields:
            if field not in parsed:
                return {"error": f"Missing required field: {field}"}
        return parsed
    except json.JSONDecodeError:
        return {"error": "Response was not valid JSON"}
```

### 5. No True Understanding

This is the philosophical one, but it has practical implications. LLMs are pattern matchers, not reasoners. They can produce correct answers to novel questions — but through statistical patterns, not comprehension.

**Why this matters for engineering:** Don't anthropomorphize the model. When it says "I think the issue is...", it's generating text, not thinking. Base your system design on what the model *reliably does* (pattern matching, text generation, classification), not on what it *appears to do* (understanding, reasoning).

***

## Local vs API Models: When to Use Which

```python
model_comparison = {
    "API Models (Claude, GPT-4)": {
        "pros": [
            "State-of-the-art quality",
            "No infrastructure to manage",
            "Always up-to-date",
            "Scale automatically"
        ],
        "cons": [
            "Per-token cost",
            "Data sent to external servers",
            "Rate limits",
            "Vendor dependency"
        ],
        "best_for": [
            "Complex reasoning tasks",
            "Production applications with variable load",
            "When quality is the top priority"
        ]
    },
    "Local Models (LLaMA, Mistral, TinyLlama)": {
        "pros": [
            "No API costs after hardware",
            "Data stays on your machine",
            "No rate limits",
            "No vendor dependency"
        ],
        "cons": [
            "Lower quality (for small models)",
            "Need GPU hardware",
            "You manage infrastructure",
            "Model updates are manual"
        ],
        "best_for": [
            "Privacy-sensitive applications",
            "High-volume, low-complexity tasks",
            "Offline environments",
            "Experimentation and learning"
        ]
    }
}

for approach, details in model_comparison.items():
    print(f"\n{approach}")
    print(f"  Best for: {', '.join(details['best_for'])}")
```

**My setup:** I run TinyLlama locally for high-volume, simple tasks (log classification, basic text extraction) and use Claude via API for complex tasks (root cause analysis, code generation, documentation writing). The local model handles 90% of requests at zero marginal cost; the API model handles the 10% that need quality.

***

## Practical Example: Building a Simple Model Comparison

```python
# Compare different "generation" approaches for a practical task

def rule_based_summary(metrics: dict) -> str:
    """Traditional approach: hand-coded logic."""
    issues = []
    if metrics.get("cpu", 0) > 80:
        issues.append(f"High CPU ({metrics['cpu']}%)")
    if metrics.get("memory", 0) > 80:
        issues.append(f"High memory ({metrics['memory']}%)")
    if metrics.get("disk", 0) > 80:
        issues.append(f"High disk ({metrics['disk']}%)")
    if metrics.get("error_rate", 0) > 5:
        issues.append(f"Elevated error rate ({metrics['error_rate']}%)")

    if not issues:
        return "All systems normal."
    return f"Issues detected: {'; '.join(issues)}. Investigate immediately."


def template_based_summary(metrics: dict) -> str:
    """Slightly smarter: templates with conditional logic."""
    severity = "critical" if any(v > 90 for v in metrics.values()) else "warning"

    parts = []
    for metric, value in metrics.items():
        if value > 80:
            parts.append(f"{metric} at {value}%")

    if not parts:
        return "System health: nominal across all metrics."

    return (
        f"[{severity.upper()}] System health degraded. "
        f"Affected: {', '.join(parts)}. "
        f"{'Immediate action required.' if severity == 'critical' else 'Monitor closely.'}"
    )


def llm_summary_prompt(metrics: dict) -> str:
    """What you'd send to an LLM — natural language, context-aware."""
    return f"""Analyze these server metrics and provide a brief health summary 
with recommended actions:

CPU: {metrics.get('cpu', 'N/A')}%
Memory: {metrics.get('memory', 'N/A')}%
Disk: {metrics.get('disk', 'N/A')}%  
Error Rate: {metrics.get('error_rate', 'N/A')}%

Context: This is a production Kubernetes node running payment services.
Priority: customer-facing, zero-downtime requirement."""


# Test all three approaches
test_metrics = {"cpu": 92, "memory": 78, "disk": 45, "error_rate": 8}

print("Rule-Based:")
print(f"  {rule_based_summary(test_metrics)}\n")

print("Template-Based:")
print(f"  {template_based_summary(test_metrics)}\n")

print("LLM Prompt (what you'd send to Claude):")
print(f"  {llm_summary_prompt(test_metrics)}")

# Expected LLM response:
# "The production payment node shows concerning metrics: CPU at 92% is critical 
#  and likely causing the elevated 8% error rate. Memory at 78% is approaching 
#  limits. Recommended: 1) Check for runaway processes or recent deployments 
#  2) Consider horizontal scaling 3) Investigate the correlation between CPU 
#  spike and error rate increase."
```

**The LLM adds what the other approaches can't:** correlation analysis ("CPU likely causing error rate"), domain-specific reasoning ("check for runaway processes"), and actionable recommendations in natural language.

***

## What's Next

Now that you understand how LLMs work, their capabilities, and their limitations, we'll explore the three main strategies for adapting these models to your specific needs: **RAG, Fine-Tuning, and Prompt Engineering.**

***

*Next:* [*Part 5 — RAG, Fine-Tuning, and Prompt Engineering*](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101/part-5-rag-finetuning-prompt-engineering)

***

[← Part 3: NLP, NLU, and NLG](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101/part-3-nlp-nlu-nlg) · [Series Overview](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101) · [Next →](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-fundamentals-101/part-5-rag-finetuning-prompt-engineering)
