Article 3: Document Loading and Chunking Strategies

Introduction

Chunking is the most underappreciated part of a RAG pipeline. The embedding model and the LLM get all the attention, but chunking is where retrieval quality is actually won or lost.

A chunk is a segment of a source document that gets embedded and stored as an independent unit. When a user asks a question, the system retrieves the top-k most similar chunks — not documents. This means the quality of retrieval depends entirely on whether the right information ends up in its own chunk, with enough surrounding context to be meaningful on its own.

This article covers the loader and chunker I built for the RAG service, including the decisions I made and the mistakes I corrected.

Why Chunking Matters

Consider a 2,000-word article about Kubernetes ingress TLS configuration. It covers:

Why TLS matters (200 words)
Cert-manager setup (400 words)
Ingress annotation reference (300 words)
A worked example with Let's Encrypt (600 words)
Troubleshooting common errors (500 words)

If I embed the whole article as one chunk, a question about "how to configure cert-manager" and a question about "TLS troubleshooting errors" will both retrieve the same chunk — because both are semantically nearby to the article as a whole. But the LLM will receive 2,000 words of context when it only needs 400 words. Useful signal is buried.

If I chunk the article well, "cert-manager setup" and "TLS troubleshooting" become separate retrievals. The right content surfaces for each question without unnecessary noise.

Chunk size trade-offs:

Chunk size

Retrieval precision

Context completeness

Too small (< 100 tokens)

High — only returns exactly relevant sentences

Low — misses surrounding context

Good (~300–512 tokens)

Good

Too large (> 1000 tokens)

Low — always returns the whole section

High — but context is noisy

The Document Loader

The loader reads files from disk and returns a RawDocument dataclass. For this project, I load markdown files (.md) from the git-book directory. The loader is intentionally simple:

# src/ingestion/loader.py
import hashlib
from dataclasses import dataclass
from pathlib import Path
import structlog

log = structlog.get_logger()

@dataclass
class RawDocument:
    file_path: str        # Relative path from corpus root
    content: str          # Raw file content
    file_hash: str        # SHA-256 of content
    title: str | None     # Extracted from first H1 heading if present
    source_type: str      # "markdown", "text", etc.
    char_count: int

def load_file(path: Path, corpus_root: Path) -> RawDocument:
    content = path.read_text(encoding="utf-8", errors="replace")
    file_hash = hashlib.sha256(content.encode()).hexdigest()
    relative_path = str(path.relative_to(corpus_root))
    title = _extract_title(content)
    
    return RawDocument(
        file_path=relative_path,
        content=content,
        file_hash=file_hash,
        title=title,
        source_type="markdown" if path.suffix == ".md" else "text",
        char_count=len(content),
    )

def _extract_title(content: str) -> str | None:
    """Extract the first H1 heading from markdown content."""
    for line in content.splitlines():
        stripped = line.strip()
        if stripped.startswith("# "):
            return stripped[2:].strip()
    return None

def scan_corpus(root: Path, extensions: tuple[str, ...] = (".md", ".txt")) -> list[Path]:
    """Recursively find all documents under root."""
    files = []
    for ext in extensions:
        files.extend(root.rglob(f"*{ext}"))
    
    # Exclude README-only directories and certain system files
    excluded = {"SUMMARY.md", "README.md"}
    return [f for f in files if f.name not in excluded]

I exclude SUMMARY.md and top-level README.md files because they're navigation aids — their content is mostly links and section headers, which don't contribute useful semantic content to knowledge base queries.

Chunking Strategy 1: Fixed-Size with Overlap

The simplest strategy: split text every N characters, with a stride so consecutive chunks overlap.

def fixed_size_chunks(
    text: str,
    chunk_size: int = 1024,    # characters
    overlap: int = 128,
) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

Problem: This slices through sentences, code blocks, and even words mid-stream. A chunk might start in the middle of a sentence. The embedding for that fragment will be noisy because the beginning of the chunk lacks context.

I used this in my first version. Retrieval quality was poor for anything that wasn't a single-sentence factual statement.

Chunking Strategy 2: Sentence-Boundary Aware

Split at sentence boundaries (., !, ? followed by whitespace) but accumulate sentences until reaching the target token count:

import re

def sentence_chunks(
    text: str,
    max_tokens: int = 400,
    overlap_sentences: int = 1,
    tokens_per_char: float = 0.25,  # Rough estimate for English
) -> list[str]:
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current: list[str] = []
    current_tokens = 0
    
    for sentence in sentences:
        sentence_tokens = int(len(sentence) * tokens_per_char)
        
        if current_tokens + sentence_tokens > max_tokens and current:
            chunks.append(" ".join(current))
            # Carry over the last `overlap_sentences` sentences
            current = current[-overlap_sentences:]
            current_tokens = sum(int(len(s) * tokens_per_char) for s in current)
        
        current.append(sentence)
        current_tokens += sentence_tokens
    
    if current:
        chunks.append(" ".join(current))
    
    return chunks

This is much better than fixed-size for prose content. But for markdown with headers, code blocks, and lists, sentence splitting breaks down — code blocks don't end with ., and bullet points get fused into a single "sentence" that spans the whole list.

Chunking Strategy 3: Markdown-Aware Splitting

For a corpus of markdown files, splitting on heading boundaries makes the most sense semantically. Each section (H2 or H3) becomes a chunk, possibly with sub-chunking if the section is very long.

# src/ingestion/chunker.py
import re
from dataclasses import dataclass

@dataclass
class Chunk:
    content: str
    chunk_index: int
    tokens: int

# Heading pattern: #, ##, ### at the start of a line
HEADING_PATTERN = re.compile(r'^#{1,3}\s+.+', re.MULTILINE)

def markdown_chunks(
    content: str,
    max_tokens: int = 512,
    overlap_tokens: int = 64,
    tokens_per_char: float = 0.25,
) -> list[Chunk]:
    """
    Split markdown on heading boundaries.
    If a section exceeds max_tokens, sub-split on sentence boundaries.
    """
    sections = _split_on_headings(content)
    chunks: list[Chunk] = []
    
    for section in sections:
        estimated_tokens = int(len(section) * tokens_per_char)
        
        if estimated_tokens <= max_tokens:
            chunks.append(section)
        else:
            # Section is too long — split further on sentence boundaries
            sub_chunks = _sentence_split(section, max_tokens, overlap_tokens, tokens_per_char)
            chunks.extend(sub_chunks)
    
    return [
        Chunk(
            content=text.strip(),
            chunk_index=i,
            tokens=int(len(text) * tokens_per_char),
        )
        for i, text in enumerate(chunks)
        if text.strip()
    ]

def _split_on_headings(content: str) -> list[str]:
    """Split content into sections at each heading boundary."""
    lines = content.splitlines(keepends=True)
    sections: list[list[str]] = []
    current: list[str] = []
    
    for line in lines:
        if HEADING_PATTERN.match(line) and current:
            sections.append(current)
            current = [line]
        else:
            current.append(line)
    
    if current:
        sections.append(current)
    
    return ["".join(section) for section in sections]

def _sentence_split(
    text: str,
    max_tokens: int,
    overlap_tokens: int,
    tokens_per_char: float,
) -> list[str]:
    """Fall back to sentence-boundary splitting for oversized sections."""
    sentences = re.split(r'(?<=[.!?\n])\s+', text)
    chunks = []
    current_sentences: list[str] = []
    current_tokens = 0
    
    for sentence in sentences:
        est = int(len(sentence) * tokens_per_char)
        if current_tokens + est > max_tokens and current_sentences:
            chunks.append(" ".join(current_sentences))
            # Carry overlap
            overlap_chars = overlap_tokens / tokens_per_char
            while current_sentences and sum(len(s) for s in current_sentences) > overlap_chars:
                current_sentences.pop(0)
            current_tokens = int(sum(len(s) * tokens_per_char for s in current_sentences))
        
        current_sentences.append(sentence)
        current_tokens += est
    
    if current_sentences:
        chunks.append(" ".join(current_sentences))
    
    return chunks

The Strategy I Use and Why

I use markdown-aware splitting for this project. Specifically:

Primary split: H2 and H3 headings (##, ###)
Secondary split: sentence boundaries when a section exceeds 512 tokens
Overlap: 64 tokens between sub-chunks within the same section

This works well for my git-book because the content is well-structured markdown. Each section under a heading is a coherent semantic unit. A chunk covering "HNSW Index Setup" is more useful to retrieve than a chunk that starts in the middle of the HNSW section and ends in the middle of the IVFFlat section.

I don't use H1 splits because H1 is the article title — splitting on H1 would make each chunk "the entire article minus the title", which defeats the purpose.

Token Counting

The chunks need an accurate token estimate, not just a character count estimate. Different embedding models tokenize differently, but for budget estimation I use tiktoken with the cl100k_base encoding (same as GPT-4 family):

# src/ingestion/chunker.py
import tiktoken

_enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    return len(_enc.encode(text))

I store this count in the chunks.tokens column. It's used to:

Enforce the max-token limit during chunking
Estimate prompt length at query time (to avoid exceeding the LLM context window)
Provide chunk size distribution metrics in the /health endpoint

Full Loader and Chunker Implementation

# src/ingestion/pipeline.py (ingestion orchestration — simplified)
from pathlib import Path
from src.ingestion.loader import load_file, scan_corpus
from src.ingestion.chunker import markdown_chunks
from src.db.models import Document, Chunk
from sqlalchemy.ext.asyncio import AsyncSession
import hashlib, structlog

log = structlog.get_logger()

async def ingest_file(
    path: Path,
    corpus_root: Path,
    db: AsyncSession,
) -> tuple[int, int]:
    """
    Ingest a single file. Returns (chunks_created, chunks_skipped).
    Skips files whose content hash hasn't changed since last ingestion.
    """
    raw = load_file(path, corpus_root)
    
    # Upsert document record
    existing = await db.execute(
        select(Document).where(Document.file_path == raw.file_path)
    )
    doc = existing.scalar_one_or_none()
    
    if doc and doc.file_hash == raw.file_hash:
        log.info("ingest.skip.unchanged", file=raw.file_path)
        return 0, 1
    
    if doc:
        # File changed — delete old chunks, re-ingest
        await db.execute(delete(Chunk).where(Chunk.document_id == doc.id))
        doc.file_hash  = raw.file_hash
        doc.char_count = raw.char_count
        doc.title      = raw.title
        doc.updated_at = func.now()
    else:
        doc = Document(
            file_path=raw.file_path,
            file_hash=raw.file_hash,
            title=raw.title,
            source_type=raw.source_type,
            char_count=raw.char_count,
        )
        db.add(doc)
        await db.flush()  # Populate doc.id
    
    # Chunk the content
    chunks = markdown_chunks(raw.content)
    
    for chunk in chunks:
        db.add(Chunk(
            document_id=doc.id,
            chunk_index=chunk.chunk_index,
            content=chunk.content,
            tokens=chunk.tokens,
            # embedding=None — filled by the embedding step
        ))
    
    await db.commit()
    log.info("ingest.complete", file=raw.file_path, chunks=len(chunks))
    return len(chunks), 0

Note that embedding=None after ingestion. Embedding runs as a separate step (Article 4). This separation means I can ingest files quickly and queue embedding as a background task, which is important when processing large batches.

What I Learned

Heading-based chunking requires consistent document structure. My early articles in this git-book had inconsistent heading levels — some used H2 for sections, some used H3, some mixed both. The markdown-aware chunker produced wildly different chunk sizes depending on which convention the article used. Fixing the articles (normalizing heading levels) was more effective than trying to make the chunker smarter about inconsistent documents.

Overlap is necessary but easy to over-tune. Overlap ensures that information near a chunk boundary doesn't disappear — the last sentences of one chunk appear again at the start of the next. Too little overlap (0) means boundary content gets fragmented. Too much overlap (> 20%) inflates the chunk count and makes retrieval noisy (two very similar chunks both get returned). 64 tokens is a practical default.

Empty and near-empty chunks waste index space. Navigation sections, tables of contents, and "see also" sections produce chunks with almost no semantic content when split. Filtering out chunks where tokens < 20 eliminated ~8% of chunks with zero loss in retrieval quality.

Code blocks should probably stay together. A code block split in half across two chunks will be retrieved by semantically unrelated queries, because the code's vector is mostly noise without its context. I'm working on a pre-processing step that identifies fenced code blocks and keeps them in the same chunk as their immediately preceding explanation paragraph.

Next: Article 4 — Generating and Storing Embeddings in pgvector

PreviousArticle 2: pgvector on PostgreSQL — Setup, Vector Types, and Indexes NextArticle 4: Generating and Storing Embeddings in pgvector

Last updated 21 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagWhy Chunking Matters

hashtagThe Document Loader

hashtagChunking Strategy 1: Fixed-Size with Overlap

hashtagChunking Strategy 2: Sentence-Boundary Aware

hashtagChunking Strategy 3: Markdown-Aware Splitting

hashtagThe Strategy I Use and Why

hashtagToken Counting

hashtagFull Loader and Chunker Implementation

hashtagWhat I Learned