Article 3: Document Loading and Chunking Strategies

Introduction

Chunking is the most underappreciated part of a RAG pipeline. The embedding model and the LLM get all the attention, but chunking is where retrieval quality is actually won or lost.

A chunk is a segment of a source document that gets embedded and stored as an independent unit. When a user asks a question, the system retrieves the top-k most similar chunks β€” not documents. This means the quality of retrieval depends entirely on whether the right information ends up in its own chunk, with enough surrounding context to be meaningful on its own.

This article covers the loader and chunker I built for the RAG service, including the decisions I made and the mistakes I corrected.


Table of Contents


Why Chunking Matters

Consider a 2,000-word article about Kubernetes ingress TLS configuration. It covers:

  • Why TLS matters (200 words)

  • Cert-manager setup (400 words)

  • Ingress annotation reference (300 words)

  • A worked example with Let's Encrypt (600 words)

  • Troubleshooting common errors (500 words)

If I embed the whole article as one chunk, a question about "how to configure cert-manager" and a question about "TLS troubleshooting errors" will both retrieve the same chunk β€” because both are semantically nearby to the article as a whole. But the LLM will receive 2,000 words of context when it only needs 400 words. Useful signal is buried.

If I chunk the article well, "cert-manager setup" and "TLS troubleshooting" become separate retrievals. The right content surfaces for each question without unnecessary noise.

Chunk size trade-offs:

Chunk size
Retrieval precision
Context completeness

Too small (< 100 tokens)

High β€” only returns exactly relevant sentences

Low β€” misses surrounding context

Good (~300–512 tokens)

Good

Good

Too large (> 1000 tokens)

Low β€” always returns the whole section

High β€” but context is noisy


The Document Loader

The loader reads files from disk and returns a RawDocument dataclass. For this project, I load markdown files (.md) from the git-book directory. The loader is intentionally simple:

I exclude SUMMARY.md and top-level README.md files because they're navigation aids β€” their content is mostly links and section headers, which don't contribute useful semantic content to knowledge base queries.


Chunking Strategy 1: Fixed-Size with Overlap

The simplest strategy: split text every N characters, with a stride so consecutive chunks overlap.

Problem: This slices through sentences, code blocks, and even words mid-stream. A chunk might start in the middle of a sentence. The embedding for that fragment will be noisy because the beginning of the chunk lacks context.

I used this in my first version. Retrieval quality was poor for anything that wasn't a single-sentence factual statement.


Chunking Strategy 2: Sentence-Boundary Aware

Split at sentence boundaries (., !, ? followed by whitespace) but accumulate sentences until reaching the target token count:

This is much better than fixed-size for prose content. But for markdown with headers, code blocks, and lists, sentence splitting breaks down β€” code blocks don't end with ., and bullet points get fused into a single "sentence" that spans the whole list.


Chunking Strategy 3: Markdown-Aware Splitting

For a corpus of markdown files, splitting on heading boundaries makes the most sense semantically. Each section (H2 or H3) becomes a chunk, possibly with sub-chunking if the section is very long.


The Strategy I Use and Why

I use markdown-aware splitting for this project. Specifically:

  • Primary split: H2 and H3 headings (##, ###)

  • Secondary split: sentence boundaries when a section exceeds 512 tokens

  • Overlap: 64 tokens between sub-chunks within the same section

This works well for my git-book because the content is well-structured markdown. Each section under a heading is a coherent semantic unit. A chunk covering "HNSW Index Setup" is more useful to retrieve than a chunk that starts in the middle of the HNSW section and ends in the middle of the IVFFlat section.

I don't use H1 splits because H1 is the article title β€” splitting on H1 would make each chunk "the entire article minus the title", which defeats the purpose.


Token Counting

The chunks need an accurate token estimate, not just a character count estimate. Different embedding models tokenize differently, but for budget estimation I use tiktoken with the cl100k_base encoding (same as GPT-4 family):

I store this count in the chunks.tokens column. It's used to:

  1. Enforce the max-token limit during chunking

  2. Estimate prompt length at query time (to avoid exceeding the LLM context window)

  3. Provide chunk size distribution metrics in the /health endpoint


Full Loader and Chunker Implementation

Note that embedding=None after ingestion. Embedding runs as a separate step (Article 4). This separation means I can ingest files quickly and queue embedding as a background task, which is important when processing large batches.


What I Learned

Heading-based chunking requires consistent document structure. My early articles in this git-book had inconsistent heading levels β€” some used H2 for sections, some used H3, some mixed both. The markdown-aware chunker produced wildly different chunk sizes depending on which convention the article used. Fixing the articles (normalizing heading levels) was more effective than trying to make the chunker smarter about inconsistent documents.

Overlap is necessary but easy to over-tune. Overlap ensures that information near a chunk boundary doesn't disappear β€” the last sentences of one chunk appear again at the start of the next. Too little overlap (0) means boundary content gets fragmented. Too much overlap (> 20%) inflates the chunk count and makes retrieval noisy (two very similar chunks both get returned). 64 tokens is a practical default.

Empty and near-empty chunks waste index space. Navigation sections, tables of contents, and "see also" sections produce chunks with almost no semantic content when split. Filtering out chunks where tokens < 20 eliminated ~8% of chunks with zero loss in retrieval quality.

Code blocks should probably stay together. A code block split in half across two chunks will be retrieved by semantically unrelated queries, because the code's vector is mostly noise without its context. I'm working on a pre-processing step that identifies fenced code blocks and keeps them in the same chunk as their immediately preceding explanation paragraph.


Next: Article 4 β€” Generating and Storing Embeddings in pgvector

Last updated