Part 3: How LLMs Work — A Practical Guide

You Don't Need to Train Models. You Need to Understand Them.

When I first started using LLM APIs, I treated them as black boxes. Send text in, get text out. It worked — until it didn't. My prompts were getting truncated with no error. Responses were inconsistent between runs. The model would confidently generate wrong information. I couldn't debug any of it because I didn't understand what was happening inside.

I'm not suggesting you need to read the "Attention Is All You Need" paper and implement a transformer from scratch (though I did work through that in my PyTorch 101 series). What I am suggesting is that understanding a few core concepts — tokenization, context windows, attention, and sampling — transforms you from someone who uses LLMs to someone who can debug and optimize LLM-powered systems.

This article covers exactly what I needed to know to build AI systems that work reliably.

Tokenization — The Foundation of Everything

The single biggest source of bugs in my early AI code was not understanding tokenization.

LLMs don't see text. They see tokens — integer IDs that represent pieces of words. When you send "Hello, world!" to an API, the model sees something like [15339, 11, 1917, 0]. Every operation the model performs — attention, generation, context tracking — happens at the token level.

Why This Matters for Your Code

import tiktoken

# GPT-4o uses the o200k_base tokenizer
enc = tiktoken.get_encoding("o200k_base")

# Simple words: often 1 token each
tokens = enc.encode("Hello world")
print(f"'Hello world' = {len(tokens)} tokens")  # 2 tokens

# Technical terms: often split into sub-words
tokens = enc.encode("Kubernetes")
print(f"'Kubernetes' = {len(tokens)} tokens")  # 1-2 tokens

# Code: tokens are expensive
code = """
def calculate_embeddings(texts: list[str]) -> list[list[float]]:
    return model.encode(texts, normalize_embeddings=True).tolist()
"""
tokens = enc.encode(code)
print(f"Code snippet = {len(tokens)} tokens")  # ~30 tokens

# JSON is token-heavy
import json
data = {"name": "AI Engineer", "skills": ["Python", "LLMs", "RAG"]}
json_str = json.dumps(data, indent=2)
tokens = enc.encode(json_str)
print(f"JSON = {len(tokens)} tokens")  # ~30 tokens

Things I learned the hard way about tokens:

Whitespace and formatting cost tokens. Pretty-printed JSON with indentation uses significantly more tokens than compact JSON. When I'm stuffing context into a prompt, I use json.dumps(data, separators=(",", ":")) instead of indent=2.
Different models use different tokenizers. A prompt that fits in GPT-4o's context window might not fit in Claude's because they tokenize differently. Always count tokens with the right tokenizer.
Non-English text is more token-expensive. If your system handles multiple languages, budget more tokens for non-English content.

Token Counting in Practice

I wrote a utility function I use in every AI project:

# src/ai_engineer/llm/tokens.py
import tiktoken


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text and model."""
    encoding_name = _get_encoding_for_model(model)
    enc = tiktoken.get_encoding(encoding_name)
    return len(enc.encode(text))


def truncate_to_tokens(text: str, max_tokens: int, model: str = "gpt-4o") -> str:
    """Truncate text to fit within a token limit."""
    encoding_name = _get_encoding_for_model(model)
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])


def _get_encoding_for_model(model: str) -> str:
    """Map model names to tiktoken encoding names."""
    if "gpt-4o" in model or "gpt-4" in model:
        return "o200k_base"
    if "gpt-3.5" in model:
        return "cl100k_base"
    # Default to o200k_base for unknown models
    return "o200k_base"

I use count_tokens() before every LLM call to make sure I'm not exceeding the context window. It's saved me from silent truncation more times than I can count.

Context Windows — Bigger Isn't Always Better

A context window is the maximum number of tokens a model can process in a single request — both input and output combined. GPT-4o supports 128k tokens. Claude supports 200k. That sounds like a lot, but I've learned to be conservative.

The Context Window Math

Context Window = Input Tokens + Output Tokens

Example with GPT-4o (128k context):
- System prompt: ~500 tokens
- Retrieved context (RAG): ~3,000 tokens
- User question: ~50 tokens
- Reserved for response: ~1,000 tokens
- Total: 4,550 tokens used out of 128,000

That leaves 123,450 tokens unused — and that's fine.

Why I Don't Fill the Context Window

When I first built my RAG system, I thought "more context = better answers" and stuffed as many retrieved chunks as possible into the prompt. The results got worse:

Attention degrades with length. Models are better at using information at the beginning and end of the context than in the middle. This is called the "lost in the middle" problem. I found that 5 highly relevant chunks outperformed 20 mixed-quality chunks.
Cost scales linearly with tokens. At $2.50 per million input tokens (GPT-4o), including 50k tokens of context per request adds up fast. In my RAG service, keeping retrieval to 3k tokens per request cut costs by 90% compared to my naive first implementation.
Latency increases. More input tokens means slower time-to-first-token. For an interactive API, this matters.

My Context Budget Strategy

# How I budget tokens in a RAG request
CONTEXT_BUDGET = {
    "system_prompt": 500,       # Instructions for the model
    "retrieved_context": 3000,  # Top-k chunks from vector search
    "user_question": 200,       # The actual question
    "response_reserve": 1024,   # Max output tokens
    "safety_margin": 500,       # Buffer for tokenizer differences
}
# Total: ~5,224 tokens per request
# Well within any model's context window

The Transformer Architecture — The 5-Minute Version

You don't need to implement a transformer to be an AI engineer. But understanding the high-level architecture helps you reason about model behavior.

The Core Idea: Attention

The transformer's key innovation is the attention mechanism. For every token, the model computes how much it should "attend to" every other token in the context.

Input:  "The cat sat on the mat"
         │    │   │   │  │   │
         ▼    ▼   ▼   ▼  ▼   ▼
      ┌──────────────────────────┐
      │   Self-Attention Layer   │
      │                          │
      │  "cat" attends strongly  │
      │  to "sat" and "mat"      │
      │                          │
      │  "mat" attends to "on"   │
      │  and "the"               │
      └──────────────────────────┘
         │    │   │   │  │   │
         ▼    ▼   ▼   ▼  ▼   ▼
      Context-aware representations

Why this matters for AI engineers:

Attention is why context works. The model doesn't just see a bag of words — it understands relationships between tokens based on their positions and meanings.
Attention is why "lost in the middle" happens. Attention scores are strongest for tokens near the query position. Tokens buried deep in long context get less attention.
Attention is why prompt structure matters. Putting instructions at the beginning (system prompt) and the question at the end gives both optimal attention.

How Generation Works

LLMs generate text one token at a time, left to right:

Step 1: Input "What is" → predict next token → "RAG"
Step 2: Input "What is RAG" → predict next token → "?"
Step 3: Input "What is RAG?" → predict next token → "\n"
Step 4: Input "What is RAG?\n" → predict next token → "RAG"
Step 5: Input "What is RAG?\nRAG" → predict next token → " stands"
...continues until stop token or max_tokens

Each step:

The model processes all tokens so far through transformer layers
It outputs a probability distribution over all possible next tokens
A sampling strategy selects the next token
That token is appended and the process repeats

This is why:

Streaming works token by token. Each token is available as soon as it's generated.
Generation cost is proportional to output length. More output tokens = more forward passes.
The model can't "go back." Once a token is generated, it influences all subsequent tokens. A wrong early token can derail the entire response.

Temperature and Sampling — Controlling Randomness

When I first started building with LLMs, every request used the default temperature. Then I noticed my structured data extraction was unreliable — sometimes returning valid JSON, sometimes not. Understanding sampling fixed this.

What Temperature Does

After processing the input through transformer layers, the model produces a probability distribution (logits) over all possible next tokens. Temperature scales these logits before sampling:

# Simplified illustration of how temperature affects token selection
import math


def apply_temperature(logits: dict[str, float], temperature: float) -> dict[str, float]:
    """Scale logits by temperature and convert to probabilities."""
    if temperature == 0:
        # Greedy: always pick the highest probability token
        max_token = max(logits, key=logits.get)
        return {t: 1.0 if t == max_token else 0.0 for t in logits}

    # Scale logits
    scaled = {t: v / temperature for t, v in logits.items()}

    # Softmax to get probabilities
    max_val = max(scaled.values())
    exp_vals = {t: math.exp(v - max_val) for t, v in scaled.items()}
    total = sum(exp_vals.values())
    return {t: v / total for t, v in exp_vals.items()}


# Example: model thinks next token is probably "Python" or "JavaScript"
logits = {"Python": 2.0, "JavaScript": 1.5, "Rust": 0.5, "Go": 0.3}

# temperature=0: always picks "Python"
print(apply_temperature(logits, 0.0))
# {"Python": 1.0, "JavaScript": 0.0, "Rust": 0.0, "Go": 0.0}

# temperature=0.1: strongly favors "Python" but not 100%
print(apply_temperature(logits, 0.1))
# {"Python": 0.99, "JavaScript": 0.007, "Rust": ~0, "Go": ~0}

# temperature=1.0: uses raw probabilities
print(apply_temperature(logits, 1.0))
# {"Python": 0.42, "JavaScript": 0.26, "Rust": 0.09, "Go": 0.07, ...}

# temperature=2.0: flattens distribution, more random
print(apply_temperature(logits, 2.0))
# {"Python": 0.30, "JavaScript": 0.27, "Rust": 0.20, "Go": 0.18, ...}

My Temperature Guidelines

Task

Temperature

Why

JSON extraction

0.0

Deterministic output, consistent structure

Code generation

0.0–0.2

Correctness matters more than creativity

Factual Q&A (RAG)

0.1

Slight variation is fine, but accuracy is primary

Summarization

0.3

Some phrasing variation improves readability

Creative writing

0.7–1.0

Higher diversity, more natural language

Top-p (Nucleus Sampling)

Top-p is complementary to temperature. Instead of scaling all probabilities, it cuts off the long tail:

top_p=0.9 means: "Only consider tokens whose cumulative probability
is within the top 90%. Ignore the rest."

Token probabilities (after temperature):
  "Python":     0.42  ← cumulative: 0.42 ✓
  "JavaScript": 0.26  ← cumulative: 0.68 ✓
  "Rust":       0.09  ← cumulative: 0.77 ✓
  "Go":         0.07  ← cumulative: 0.84 ✓
  "TypeScript": 0.06  ← cumulative: 0.90 ✓
  "C++":        0.04  ← cumulative: 0.94 ✗ (cut off)
  "Java":       0.03  ← cumulative: 0.97 ✗
  ...

In practice, I set temperature for the overall "creativity" level and leave top_p at 1.0 (disabled). Tuning both simultaneously makes behavior harder to reason about.

Local Models vs API Models

I use both in my projects. Here's how I decide:

API Models (GitHub Models, OpenAI, Anthropic)

# Calling an API model with httpx
import httpx


async def call_api_model(
    prompt: str,
    model: str = "gpt-4o",
    api_key: str = "",
) -> str:
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://models.inference.ai.azure.com/chat/completions",
            headers={
                "Authorization": f"Bearer {api_key}",
                "Content-Type": "application/json",
            },
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1,
                "max_tokens": 512,
            },
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

When I use API models:

Production systems where quality matters
Complex reasoning tasks (multi-step analysis, code review)
When I need the latest model capabilities
When I can tolerate network latency

Local Models (via Ollama or llama.cpp)

# Calling a local model through Ollama's API
import httpx


async def call_local_model(
    prompt: str,
    model: str = "llama3.2:3b",
) -> str:
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.1,
                    "num_predict": 512,
                },
            },
        )
        response.raise_for_status()
        return response.json()["response"]

When I use local models:

Development and prototyping (no API costs during iteration)
Privacy-sensitive data that can't leave my machine
Simple tasks where a 3B parameter model is sufficient (classification, extraction)
When I need guaranteed availability (no network dependency)

The Provider Abstraction

I always abstract the model provider so I can switch between local and API models with a config change:

# src/ai_engineer/llm/base.py
from typing import Protocol


class LLMProvider(Protocol):
    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
    ) -> str: ...


# src/ai_engineer/llm/github.py
import httpx
from ai_engineer.config import settings


class GitHubModelsProvider:
    def __init__(self) -> None:
        self._client = httpx.AsyncClient(
            base_url="https://models.inference.ai.azure.com",
            headers={"Authorization": f"Bearer {settings.llm_api_key}"},
            timeout=30.0,
        )

    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
    ) -> str:
        response = await self._client.post(
            "/chat/completions",
            json={
                "model": settings.llm_model,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "temperature": temperature,
            },
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]


# src/ai_engineer/llm/local.py
import httpx


class OllamaProvider:
    def __init__(self, model: str = "llama3.2:3b") -> None:
        self._model = model
        self._client = httpx.AsyncClient(
            base_url="http://localhost:11434",
            timeout=60.0,
        )

    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
    ) -> str:
        response = await self._client.post(
            "/api/generate",
            json={
                "model": self._model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": temperature,
                    "num_predict": max_tokens,
                },
            },
        )
        response.raise_for_status()
        return response.json()["response"]

In config.py, a single environment variable (LLM_PROVIDER=github or LLM_PROVIDER=local) determines which implementation gets used.

Key Takeaways

After building several LLM-powered systems, these are the concepts I rely on daily:

Count tokens before calling the API. Silent truncation is a real source of bugs. Always know how many tokens your prompt uses.
Use temperature=0 for structured output. If you need JSON, code, or deterministic results, eliminate randomness.
Don't fill the context window. 5 relevant chunks beat 50 random chunks. Quality of context matters more than quantity.
Abstract your model provider. You will switch between models — for cost, quality, latency, or availability. Make it a config change, not a rewrite.
Start with API models, optimize with local. API models give you the best quality for development. Once you understand your task, evaluate whether a smaller local model can handle it.

Previous: Part 2 — Python Tooling for AI Engineers

Next: Part 4 — Embeddings and Vector Search

PreviousPart 2: Python Tooling for AI Engineers NextPart 4: Embeddings and Vector Search

Last updated 7 hours ago

hashtagYou Don't Need to Train Models. You Need to Understand Them.

hashtagTokenization — The Foundation of Everything

hashtagWhy This Matters for Your Code

hashtagToken Counting in Practice

hashtagContext Windows — Bigger Isn't Always Better

hashtagThe Context Window Math

hashtagWhy I Don't Fill the Context Window

hashtagMy Context Budget Strategy

hashtagThe Transformer Architecture — The 5-Minute Version

hashtagThe Core Idea: Attention

hashtagHow Generation Works

hashtagTemperature and Sampling — Controlling Randomness

hashtagWhat Temperature Does

hashtagMy Temperature Guidelines

hashtagTop-p (Nucleus Sampling)

hashtagLocal Models vs API Models

hashtagAPI Models (GitHub Models, OpenAI, Anthropic)

hashtagLocal Models (via Ollama or llama.cpp)

hashtagThe Provider Abstraction

hashtagKey Takeaways