Part 3: How LLMs Work — A Practical Guide

You Don't Need to Train Models. You Need to Understand Them.

When I first started using LLM APIs, I treated them as black boxes. Send text in, get text out. It worked — until it didn't. My prompts were getting truncated with no error. Responses were inconsistent between runs. The model would confidently generate wrong information. I couldn't debug any of it because I didn't understand what was happening inside.

I'm not suggesting you need to read the "Attention Is All You Need" paper and implement a transformer from scratch (though I did work through that in my PyTorch 101 series). What I am suggesting is that understanding a few core concepts — tokenization, context windows, attention, and sampling — transforms you from someone who uses LLMs to someone who can debug and optimize LLM-powered systems.

This article covers exactly what I needed to know to build AI systems that work reliably.


Tokenization — The Foundation of Everything

The single biggest source of bugs in my early AI code was not understanding tokenization.

LLMs don't see text. They see tokens — integer IDs that represent pieces of words. When you send "Hello, world!" to an API, the model sees something like [15339, 11, 1917, 0]. Every operation the model performs — attention, generation, context tracking — happens at the token level.

Why This Matters for Your Code

import tiktoken

# GPT-4o uses the o200k_base tokenizer
enc = tiktoken.get_encoding("o200k_base")

# Simple words: often 1 token each
tokens = enc.encode("Hello world")
print(f"'Hello world' = {len(tokens)} tokens")  # 2 tokens

# Technical terms: often split into sub-words
tokens = enc.encode("Kubernetes")
print(f"'Kubernetes' = {len(tokens)} tokens")  # 1-2 tokens

# Code: tokens are expensive
code = """
def calculate_embeddings(texts: list[str]) -> list[list[float]]:
    return model.encode(texts, normalize_embeddings=True).tolist()
"""
tokens = enc.encode(code)
print(f"Code snippet = {len(tokens)} tokens")  # ~30 tokens

# JSON is token-heavy
import json
data = {"name": "AI Engineer", "skills": ["Python", "LLMs", "RAG"]}
json_str = json.dumps(data, indent=2)
tokens = enc.encode(json_str)
print(f"JSON = {len(tokens)} tokens")  # ~30 tokens

Things I learned the hard way about tokens:

  1. Whitespace and formatting cost tokens. Pretty-printed JSON with indentation uses significantly more tokens than compact JSON. When I'm stuffing context into a prompt, I use json.dumps(data, separators=(",", ":")) instead of indent=2.

  2. Different models use different tokenizers. A prompt that fits in GPT-4o's context window might not fit in Claude's because they tokenize differently. Always count tokens with the right tokenizer.

  3. Non-English text is more token-expensive. If your system handles multiple languages, budget more tokens for non-English content.

Token Counting in Practice

I wrote a utility function I use in every AI project:

I use count_tokens() before every LLM call to make sure I'm not exceeding the context window. It's saved me from silent truncation more times than I can count.


Context Windows — Bigger Isn't Always Better

A context window is the maximum number of tokens a model can process in a single request — both input and output combined. GPT-4o supports 128k tokens. Claude supports 200k. That sounds like a lot, but I've learned to be conservative.

The Context Window Math

Why I Don't Fill the Context Window

When I first built my RAG system, I thought "more context = better answers" and stuffed as many retrieved chunks as possible into the prompt. The results got worse:

  1. Attention degrades with length. Models are better at using information at the beginning and end of the context than in the middle. This is called the "lost in the middle" problem. I found that 5 highly relevant chunks outperformed 20 mixed-quality chunks.

  2. Cost scales linearly with tokens. At $2.50 per million input tokens (GPT-4o), including 50k tokens of context per request adds up fast. In my RAG service, keeping retrieval to 3k tokens per request cut costs by 90% compared to my naive first implementation.

  3. Latency increases. More input tokens means slower time-to-first-token. For an interactive API, this matters.

My Context Budget Strategy


The Transformer Architecture — The 5-Minute Version

You don't need to implement a transformer to be an AI engineer. But understanding the high-level architecture helps you reason about model behavior.

The Core Idea: Attention

The transformer's key innovation is the attention mechanism. For every token, the model computes how much it should "attend to" every other token in the context.

Why this matters for AI engineers:

  • Attention is why context works. The model doesn't just see a bag of words — it understands relationships between tokens based on their positions and meanings.

  • Attention is why "lost in the middle" happens. Attention scores are strongest for tokens near the query position. Tokens buried deep in long context get less attention.

  • Attention is why prompt structure matters. Putting instructions at the beginning (system prompt) and the question at the end gives both optimal attention.

How Generation Works

LLMs generate text one token at a time, left to right:

Each step:

  1. The model processes all tokens so far through transformer layers

  2. It outputs a probability distribution over all possible next tokens

  3. A sampling strategy selects the next token

  4. That token is appended and the process repeats

This is why:

  • Streaming works token by token. Each token is available as soon as it's generated.

  • Generation cost is proportional to output length. More output tokens = more forward passes.

  • The model can't "go back." Once a token is generated, it influences all subsequent tokens. A wrong early token can derail the entire response.


Temperature and Sampling — Controlling Randomness

When I first started building with LLMs, every request used the default temperature. Then I noticed my structured data extraction was unreliable — sometimes returning valid JSON, sometimes not. Understanding sampling fixed this.

What Temperature Does

After processing the input through transformer layers, the model produces a probability distribution (logits) over all possible next tokens. Temperature scales these logits before sampling:

My Temperature Guidelines

Task
Temperature
Why

JSON extraction

0.0

Deterministic output, consistent structure

Code generation

0.0–0.2

Correctness matters more than creativity

Factual Q&A (RAG)

0.1

Slight variation is fine, but accuracy is primary

Summarization

0.3

Some phrasing variation improves readability

Creative writing

0.7–1.0

Higher diversity, more natural language

Top-p (Nucleus Sampling)

Top-p is complementary to temperature. Instead of scaling all probabilities, it cuts off the long tail:

In practice, I set temperature for the overall "creativity" level and leave top_p at 1.0 (disabled). Tuning both simultaneously makes behavior harder to reason about.


Local Models vs API Models

I use both in my projects. Here's how I decide:

API Models (GitHub Models, OpenAI, Anthropic)

When I use API models:

  • Production systems where quality matters

  • Complex reasoning tasks (multi-step analysis, code review)

  • When I need the latest model capabilities

  • When I can tolerate network latency

Local Models (via Ollama or llama.cpp)

When I use local models:

  • Development and prototyping (no API costs during iteration)

  • Privacy-sensitive data that can't leave my machine

  • Simple tasks where a 3B parameter model is sufficient (classification, extraction)

  • When I need guaranteed availability (no network dependency)

The Provider Abstraction

I always abstract the model provider so I can switch between local and API models with a config change:

In config.py, a single environment variable (LLM_PROVIDER=github or LLM_PROVIDER=local) determines which implementation gets used.


Key Takeaways

After building several LLM-powered systems, these are the concepts I rely on daily:

  1. Count tokens before calling the API. Silent truncation is a real source of bugs. Always know how many tokens your prompt uses.

  2. Use temperature=0 for structured output. If you need JSON, code, or deterministic results, eliminate randomness.

  3. Don't fill the context window. 5 relevant chunks beat 50 random chunks. Quality of context matters more than quantity.

  4. Abstract your model provider. You will switch between models — for cost, quality, latency, or availability. Make it a config change, not a rewrite.

  5. Start with API models, optimize with local. API models give you the best quality for development. Once you understand your task, evaluate whether a smaller local model can handle it.


Previous: Part 2 — Python Tooling for AI Engineers

Next: Part 4 — Embeddings and Vector Search

Last updated