Part 3: How LLMs Work — A Practical Guide
You Don't Need to Train Models. You Need to Understand Them.
Tokenization — The Foundation of Everything
Why This Matters for Your Code
import tiktoken
# GPT-4o uses the o200k_base tokenizer
enc = tiktoken.get_encoding("o200k_base")
# Simple words: often 1 token each
tokens = enc.encode("Hello world")
print(f"'Hello world' = {len(tokens)} tokens") # 2 tokens
# Technical terms: often split into sub-words
tokens = enc.encode("Kubernetes")
print(f"'Kubernetes' = {len(tokens)} tokens") # 1-2 tokens
# Code: tokens are expensive
code = """
def calculate_embeddings(texts: list[str]) -> list[list[float]]:
return model.encode(texts, normalize_embeddings=True).tolist()
"""
tokens = enc.encode(code)
print(f"Code snippet = {len(tokens)} tokens") # ~30 tokens
# JSON is token-heavy
import json
data = {"name": "AI Engineer", "skills": ["Python", "LLMs", "RAG"]}
json_str = json.dumps(data, indent=2)
tokens = enc.encode(json_str)
print(f"JSON = {len(tokens)} tokens") # ~30 tokensToken Counting in Practice
Context Windows — Bigger Isn't Always Better
The Context Window Math
Why I Don't Fill the Context Window
My Context Budget Strategy
The Transformer Architecture — The 5-Minute Version
The Core Idea: Attention
How Generation Works
Temperature and Sampling — Controlling Randomness
What Temperature Does
My Temperature Guidelines
Task
Temperature
Why
Top-p (Nucleus Sampling)
Local Models vs API Models
API Models (GitHub Models, OpenAI, Anthropic)
Local Models (via Ollama or llama.cpp)
The Provider Abstraction
Key Takeaways
Last updated