# Part 3: How LLMs Work — A Practical Guide

## You Don't Need to Train Models. You Need to Understand Them.

When I first started using LLM APIs, I treated them as black boxes. Send text in, get text out. It worked — until it didn't. My prompts were getting truncated with no error. Responses were inconsistent between runs. The model would confidently generate wrong information. I couldn't debug any of it because I didn't understand what was happening inside.

I'm not suggesting you need to read the "Attention Is All You Need" paper and implement a transformer from scratch (though I did work through that in my [PyTorch 101](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/pytorch-101) series). What I am suggesting is that understanding a few core concepts — tokenization, context windows, attention, and sampling — transforms you from someone who uses LLMs to someone who can debug and optimize LLM-powered systems.

This article covers exactly what I needed to know to build AI systems that work reliably.

***

## Tokenization — The Foundation of Everything

The single biggest source of bugs in my early AI code was not understanding tokenization.

LLMs don't see text. They see tokens — integer IDs that represent pieces of words. When you send `"Hello, world!"` to an API, the model sees something like `[15339, 11, 1917, 0]`. Every operation the model performs — attention, generation, context tracking — happens at the token level.

### Why This Matters for Your Code

```python
import tiktoken

# GPT-4o uses the o200k_base tokenizer
enc = tiktoken.get_encoding("o200k_base")

# Simple words: often 1 token each
tokens = enc.encode("Hello world")
print(f"'Hello world' = {len(tokens)} tokens")  # 2 tokens

# Technical terms: often split into sub-words
tokens = enc.encode("Kubernetes")
print(f"'Kubernetes' = {len(tokens)} tokens")  # 1-2 tokens

# Code: tokens are expensive
code = """
def calculate_embeddings(texts: list[str]) -> list[list[float]]:
    return model.encode(texts, normalize_embeddings=True).tolist()
"""
tokens = enc.encode(code)
print(f"Code snippet = {len(tokens)} tokens")  # ~30 tokens

# JSON is token-heavy
import json
data = {"name": "AI Engineer", "skills": ["Python", "LLMs", "RAG"]}
json_str = json.dumps(data, indent=2)
tokens = enc.encode(json_str)
print(f"JSON = {len(tokens)} tokens")  # ~30 tokens
```

Things I learned the hard way about tokens:

1. **Whitespace and formatting cost tokens.** Pretty-printed JSON with indentation uses significantly more tokens than compact JSON. When I'm stuffing context into a prompt, I use `json.dumps(data, separators=(",", ":"))` instead of `indent=2`.
2. **Different models use different tokenizers.** A prompt that fits in GPT-4o's context window might not fit in Claude's because they tokenize differently. Always count tokens with the right tokenizer.
3. **Non-English text is more token-expensive.** If your system handles multiple languages, budget more tokens for non-English content.

### Token Counting in Practice

I wrote a utility function I use in every AI project:

```python
# src/ai_engineer/llm/tokens.py
import tiktoken


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text and model."""
    encoding_name = _get_encoding_for_model(model)
    enc = tiktoken.get_encoding(encoding_name)
    return len(enc.encode(text))


def truncate_to_tokens(text: str, max_tokens: int, model: str = "gpt-4o") -> str:
    """Truncate text to fit within a token limit."""
    encoding_name = _get_encoding_for_model(model)
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])


def _get_encoding_for_model(model: str) -> str:
    """Map model names to tiktoken encoding names."""
    if "gpt-4o" in model or "gpt-4" in model:
        return "o200k_base"
    if "gpt-3.5" in model:
        return "cl100k_base"
    # Default to o200k_base for unknown models
    return "o200k_base"
```

I use `count_tokens()` before every LLM call to make sure I'm not exceeding the context window. It's saved me from silent truncation more times than I can count.

***

## Context Windows — Bigger Isn't Always Better

A context window is the maximum number of tokens a model can process in a single request — both input and output combined. GPT-4o supports 128k tokens. Claude supports 200k. That sounds like a lot, but I've learned to be conservative.

### The Context Window Math

```
Context Window = Input Tokens + Output Tokens

Example with GPT-4o (128k context):
- System prompt: ~500 tokens
- Retrieved context (RAG): ~3,000 tokens
- User question: ~50 tokens
- Reserved for response: ~1,000 tokens
- Total: 4,550 tokens used out of 128,000

That leaves 123,450 tokens unused — and that's fine.
```

### Why I Don't Fill the Context Window

When I first built my RAG system, I thought "more context = better answers" and stuffed as many retrieved chunks as possible into the prompt. The results got worse:

1. **Attention degrades with length.** Models are better at using information at the beginning and end of the context than in the middle. This is called the "lost in the middle" problem. I found that 5 highly relevant chunks outperformed 20 mixed-quality chunks.
2. **Cost scales linearly with tokens.** At $2.50 per million input tokens (GPT-4o), including 50k tokens of context per request adds up fast. In my RAG service, keeping retrieval to 3k tokens per request cut costs by 90% compared to my naive first implementation.
3. **Latency increases.** More input tokens means slower time-to-first-token. For an interactive API, this matters.

### My Context Budget Strategy

```python
# How I budget tokens in a RAG request
CONTEXT_BUDGET = {
    "system_prompt": 500,       # Instructions for the model
    "retrieved_context": 3000,  # Top-k chunks from vector search
    "user_question": 200,       # The actual question
    "response_reserve": 1024,   # Max output tokens
    "safety_margin": 500,       # Buffer for tokenizer differences
}
# Total: ~5,224 tokens per request
# Well within any model's context window
```

***

## The Transformer Architecture — The 5-Minute Version

You don't need to implement a transformer to be an AI engineer. But understanding the high-level architecture helps you reason about model behavior.

### The Core Idea: Attention

The transformer's key innovation is the attention mechanism. For every token, the model computes how much it should "attend to" every other token in the context.

```
Input:  "The cat sat on the mat"
         │    │   │   │  │   │
         ▼    ▼   ▼   ▼  ▼   ▼
      ┌──────────────────────────┐
      │   Self-Attention Layer   │
      │                          │
      │  "cat" attends strongly  │
      │  to "sat" and "mat"      │
      │                          │
      │  "mat" attends to "on"   │
      │  and "the"               │
      └──────────────────────────┘
         │    │   │   │  │   │
         ▼    ▼   ▼   ▼  ▼   ▼
      Context-aware representations
```

Why this matters for AI engineers:

* **Attention is why context works.** The model doesn't just see a bag of words — it understands relationships between tokens based on their positions and meanings.
* **Attention is why "lost in the middle" happens.** Attention scores are strongest for tokens near the query position. Tokens buried deep in long context get less attention.
* **Attention is why prompt structure matters.** Putting instructions at the beginning (system prompt) and the question at the end gives both optimal attention.

### How Generation Works

LLMs generate text one token at a time, left to right:

```
Step 1: Input "What is" → predict next token → "RAG"
Step 2: Input "What is RAG" → predict next token → "?"
Step 3: Input "What is RAG?" → predict next token → "\n"
Step 4: Input "What is RAG?\n" → predict next token → "RAG"
Step 5: Input "What is RAG?\nRAG" → predict next token → " stands"
...continues until stop token or max_tokens
```

Each step:

1. The model processes all tokens so far through transformer layers
2. It outputs a probability distribution over all possible next tokens
3. A sampling strategy selects the next token
4. That token is appended and the process repeats

This is why:

* **Streaming works token by token.** Each token is available as soon as it's generated.
* **Generation cost is proportional to output length.** More output tokens = more forward passes.
* **The model can't "go back."** Once a token is generated, it influences all subsequent tokens. A wrong early token can derail the entire response.

***

## Temperature and Sampling — Controlling Randomness

When I first started building with LLMs, every request used the default temperature. Then I noticed my structured data extraction was unreliable — sometimes returning valid JSON, sometimes not. Understanding sampling fixed this.

### What Temperature Does

After processing the input through transformer layers, the model produces a probability distribution (logits) over all possible next tokens. Temperature scales these logits before sampling:

```python
# Simplified illustration of how temperature affects token selection
import math


def apply_temperature(logits: dict[str, float], temperature: float) -> dict[str, float]:
    """Scale logits by temperature and convert to probabilities."""
    if temperature == 0:
        # Greedy: always pick the highest probability token
        max_token = max(logits, key=logits.get)
        return {t: 1.0 if t == max_token else 0.0 for t in logits}

    # Scale logits
    scaled = {t: v / temperature for t, v in logits.items()}

    # Softmax to get probabilities
    max_val = max(scaled.values())
    exp_vals = {t: math.exp(v - max_val) for t, v in scaled.items()}
    total = sum(exp_vals.values())
    return {t: v / total for t, v in exp_vals.items()}


# Example: model thinks next token is probably "Python" or "JavaScript"
logits = {"Python": 2.0, "JavaScript": 1.5, "Rust": 0.5, "Go": 0.3}

# temperature=0: always picks "Python"
print(apply_temperature(logits, 0.0))
# {"Python": 1.0, "JavaScript": 0.0, "Rust": 0.0, "Go": 0.0}

# temperature=0.1: strongly favors "Python" but not 100%
print(apply_temperature(logits, 0.1))
# {"Python": 0.99, "JavaScript": 0.007, "Rust": ~0, "Go": ~0}

# temperature=1.0: uses raw probabilities
print(apply_temperature(logits, 1.0))
# {"Python": 0.42, "JavaScript": 0.26, "Rust": 0.09, "Go": 0.07, ...}

# temperature=2.0: flattens distribution, more random
print(apply_temperature(logits, 2.0))
# {"Python": 0.30, "JavaScript": 0.27, "Rust": 0.20, "Go": 0.18, ...}
```

### My Temperature Guidelines

| Task               | Temperature | Why                                               |
| ------------------ | ----------- | ------------------------------------------------- |
| JSON extraction    | 0.0         | Deterministic output, consistent structure        |
| Code generation    | 0.0–0.2     | Correctness matters more than creativity          |
| Factual Q\&A (RAG) | 0.1         | Slight variation is fine, but accuracy is primary |
| Summarization      | 0.3         | Some phrasing variation improves readability      |
| Creative writing   | 0.7–1.0     | Higher diversity, more natural language           |

### Top-p (Nucleus Sampling)

Top-p is complementary to temperature. Instead of scaling all probabilities, it cuts off the long tail:

```
top_p=0.9 means: "Only consider tokens whose cumulative probability
is within the top 90%. Ignore the rest."

Token probabilities (after temperature):
  "Python":     0.42  ← cumulative: 0.42 ✓
  "JavaScript": 0.26  ← cumulative: 0.68 ✓
  "Rust":       0.09  ← cumulative: 0.77 ✓
  "Go":         0.07  ← cumulative: 0.84 ✓
  "TypeScript": 0.06  ← cumulative: 0.90 ✓
  "C++":        0.04  ← cumulative: 0.94 ✗ (cut off)
  "Java":       0.03  ← cumulative: 0.97 ✗
  ...
```

In practice, I set `temperature` for the overall "creativity" level and leave `top_p` at `1.0` (disabled). Tuning both simultaneously makes behavior harder to reason about.

***

## Local Models vs API Models

I use both in my projects. Here's how I decide:

### API Models (GitHub Models, OpenAI, Anthropic)

```python
# Calling an API model with httpx
import httpx


async def call_api_model(
    prompt: str,
    model: str = "gpt-4o",
    api_key: str = "",
) -> str:
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://models.inference.ai.azure.com/chat/completions",
            headers={
                "Authorization": f"Bearer {api_key}",
                "Content-Type": "application/json",
            },
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1,
                "max_tokens": 512,
            },
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
```

**When I use API models:**

* Production systems where quality matters
* Complex reasoning tasks (multi-step analysis, code review)
* When I need the latest model capabilities
* When I can tolerate network latency

### Local Models (via Ollama or llama.cpp)

```python
# Calling a local model through Ollama's API
import httpx


async def call_local_model(
    prompt: str,
    model: str = "llama3.2:3b",
) -> str:
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.1,
                    "num_predict": 512,
                },
            },
        )
        response.raise_for_status()
        return response.json()["response"]
```

**When I use local models:**

* Development and prototyping (no API costs during iteration)
* Privacy-sensitive data that can't leave my machine
* Simple tasks where a 3B parameter model is sufficient (classification, extraction)
* When I need guaranteed availability (no network dependency)

### The Provider Abstraction

I always abstract the model provider so I can switch between local and API models with a config change:

```python
# src/ai_engineer/llm/base.py
from typing import Protocol


class LLMProvider(Protocol):
    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
    ) -> str: ...


# src/ai_engineer/llm/github.py
import httpx
from ai_engineer.config import settings


class GitHubModelsProvider:
    def __init__(self) -> None:
        self._client = httpx.AsyncClient(
            base_url="https://models.inference.ai.azure.com",
            headers={"Authorization": f"Bearer {settings.llm_api_key}"},
            timeout=30.0,
        )

    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
    ) -> str:
        response = await self._client.post(
            "/chat/completions",
            json={
                "model": settings.llm_model,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "temperature": temperature,
            },
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]


# src/ai_engineer/llm/local.py
import httpx


class OllamaProvider:
    def __init__(self, model: str = "llama3.2:3b") -> None:
        self._model = model
        self._client = httpx.AsyncClient(
            base_url="http://localhost:11434",
            timeout=60.0,
        )

    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
    ) -> str:
        response = await self._client.post(
            "/api/generate",
            json={
                "model": self._model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": temperature,
                    "num_predict": max_tokens,
                },
            },
        )
        response.raise_for_status()
        return response.json()["response"]
```

In `config.py`, a single environment variable (`LLM_PROVIDER=github` or `LLM_PROVIDER=local`) determines which implementation gets used.

***

## Key Takeaways

After building several LLM-powered systems, these are the concepts I rely on daily:

1. **Count tokens before calling the API.** Silent truncation is a real source of bugs. Always know how many tokens your prompt uses.
2. **Use `temperature=0` for structured output.** If you need JSON, code, or deterministic results, eliminate randomness.
3. **Don't fill the context window.** 5 relevant chunks beat 50 random chunks. Quality of context matters more than quantity.
4. **Abstract your model provider.** You will switch between models — for cost, quality, latency, or availability. Make it a config change, not a rewrite.
5. **Start with API models, optimize with local.** API models give you the best quality for development. Once you understand your task, evaluate whether a smaller local model can handle it.

***

**Previous:** [**Part 2 — Python Tooling for AI Engineers**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-2-python-tooling)

**Next:** [**Part 4 — Embeddings and Vector Search**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-4-embeddings-and-vector-search)