# Article 1: What is RAG and Why I Built One

## Introduction

I've been writing technical articles in this git-book for a couple of years. By now it has hundreds of pages across Kubernetes, AIOps, architecture patterns, DevOps, security, and data engineering. Reading it cover-to-cover isn't practical, and the folder structure only helps if you already know what section something lives in.

The problem became concrete when I noticed I was rewriting information I'd already documented. I'd be writing about Kubernetes ingress configuration and realize I'd covered basically the same TLS setup in a different section six months ago — but I couldn't find it quickly. Full-text search helps, but only if you remember the exact words you used.

What I needed was something that understood the meaning of my question and found the relevant content regardless of exact wording. That's what RAG enabled me to build.

***

## Table of Contents

1. [The Context Window Problem](#the-context-window-problem)
2. [What Retrieval-Augmented Generation Is](#what-retrieval-augmented-generation-is)
3. [How RAG Actually Works](#how-rag-actually-works)
4. [Why Not Just Use a Search Engine](#why-not-just-use-a-search-engine)
5. [Why PostgreSQL + pgvector, Not a Dedicated Vector DB](#why-postgresql-pgvector)
6. [What We're Building](#what-were-building)
7. [What I Learned Building My First RAG](#what-i-learned)

***

## The Context Window Problem

Modern LLMs can accept large context windows — GPT-4o supports up to 128k tokens. That sounds like a lot until you realize:

* 128k tokens is roughly 96,000 words — about one large non-fiction book
* Every token in context costs money on API-billed models
* Latency increases with context length
* LLMs are known to "lose" information buried in the middle of long contexts (the "lost in the middle" problem from the 2023 paper by Liu et al.)

You can't just paste your entire knowledge base into the context and ask questions. You need to select the most relevant pieces.

That selection problem is what RAG solves.

***

## What Retrieval-Augmented Generation Is

RAG is a two-phase process:

1. **Retrieval**: Given a user's question, find the most semantically relevant chunks from your document corpus
2. **Generation**: Provide those chunks as context to an LLM, then ask it to answer the question using that context

The key insight is that you're not asking the LLM to remember everything — you're giving it a focused reading list at query time.

```
User question
     ↓
[Embedding model] → question vector
     ↓
[Vector database] → top-k similar document chunks
     ↓
[Prompt builder] → "Answer this question using the following context: ..."
     ↓
[LLM] → grounded answer
```

The addition of retrieval changes the LLM's role fundamentally. Without RAG, the LLM is a compressor of its training data — it knows what was in its training set, and nothing else. With RAG, the LLM becomes a reader and reasoner over documents you provide at runtime. It can answer questions about content that didn't exist when it was trained.

***

## How RAG Actually Works

There are two distinct pipelines:

### Ingestion Pipeline (offline or background)

```
Raw documents (markdown, PDF, text)
         ↓
    [Loader]          → read file contents
         ↓
    [Chunker]         → split into ~512 token segments with overlap
         ↓  
[Embedding model]     → convert each chunk to a dense vector (e.g. 384 dims)
         ↓
[Vector database]     → store (chunk_text, vector, metadata) in pgvector
```

This runs once per document (and again when documents change). For my git-book it runs as a background job.

### Query Pipeline (request time)

```
User question: "How do I configure TLS on a Kubernetes ingress?"
         ↓
[Embedding model]     → question vector
         ↓
[pgvector query]      → SELECT ... ORDER BY embedding <=> query_vec LIMIT 5
         ↓
[Retrieved chunks]    → 5 most semantically similar passages
         ↓
[Prompt builder]      → system + retrieved context + user question
         ↓
[LLM API call]        → GPT-4o with 5 context chunks
         ↓
Answer: "To configure TLS on a Kubernetes ingress, you need to..."
```

The query pipeline runs on every user request. The embedding call is fast (local model: \~5ms, API: \~200ms). The vector search in pgvector is fast (< 10ms for small corpora with HNSW index). The LLM call dominates latency.

***

## Why Not Just Use a Search Engine

Full-text search (like PostgreSQL's `tsvector` / `tsquery`) matches on words. Two sentences with opposite meanings but the same keywords will score identically. Synonyms score zero unless explicitly configured. Paraphrase — asking about "certificate expiry" when the document says "TLS cert renewal" — returns nothing.

Vector search operates on semantic similarity. The embedding of "TLS certificate expiry" and "certificate renewal time" will be close in vector space even though they share no keywords, because they mean similar things and appeared in similar training contexts.

Vector search is not universally better than keyword search — it's better for semantic matching and paraphrase, worse for exact terms, code snippets, and proper nouns. The best systems use hybrid search (vector + full-text combined), which I cover in Article 5.

***

## Why PostgreSQL + pgvector, Not a Dedicated Vector DB <a href="#why-postgresql-pgvector" id="why-postgresql-pgvector"></a>

I use PostgreSQL for everything else in my personal projects — user records, event logs, configuration, application state. Adding `pgvector` to an existing PostgreSQL instance means:

* One less service to run and maintain
* Vectors live alongside relational metadata in the same database
* All the standard PostgreSQL tooling (backups, replication, `psql`, pgAdmin) works unchanged
* Transactions span both relational and vector operations
* No new infrastructure billing

Dedicated vector databases like Pinecone, Weaviate, or Qdrant have advantages at scale — billion-vector indices, specialized distributed retrieval — but those advantages don't matter for a personal knowledge base with a few thousand chunks. PostgreSQL with pgvector is the correct choice for this scale.

***

## What We're Building

A self-hosted RAG service that answers questions against my git-book content. The full stack:

* **Python 3.12** with async/await throughout
* **FastAPI** for the REST API (ingestion and query endpoints)
* **PostgreSQL 16 + pgvector** as the vector store
* **SQLAlchemy 2 async** for database access
* **sentence-transformers** (`all-MiniLM-L6-v2`) for local embedding
* **GitHub Models API** as an alternative for API-based embedding and for generation (GPT-4o)

The end result is a service where I can POST a directory of markdown files to `/ingest` and then POST questions to `/query` and get grounded, sourced answers.

***

## What I Learned Building My First RAG

**Chunking is more important than the embedding model.** I spent time comparing embedding models and got incremental improvements. Switching from naïve fixed-size chunking to sentence-boundary-aware chunking with overlap produced more relevant retrievals than any model swap.

**The grounding instruction in the prompt matters.** Without an explicit instruction to only use the provided context, the LLM will blend its training knowledge with the retrieved content — and won't tell you which parts came from where. Adding "Answer using only the provided context. If the answer is not in the context, say so." dramatically improved faithfulness.

**Not every question is a RAG question.** "What is the capital of France?" should not hit the vector database. Simple factual questions are better answered directly. Routing logic at the query endpoint — deciding whether to retrieve before answering — is worth adding.

**Storage is cheap; re-embedding is slow.** Storing the original chunk text alongside the vector is non-negotiable. I made the mistake of thinking I could always re-embed on demand. In practice, re-embedding a corpus of \~1,500 chunks takes 45 seconds with the local model. Storing the source text once costs kilobytes.

***

**Next**: [Article 2 — pgvector on PostgreSQL: Setup, Vector Types, and Indexes](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/rag-101/rag-101-pgvector-setup)
