Article 5: Semantic Search, Cosine Similarity, and Hybrid Retrieval
Introduction
Retrieval is the "R" in RAG. Given a user's question, the retrieval layer finds the chunks most likely to contain the answer. The quality of everything downstream β the generated response, the cited sources, the accuracy of the answer β depends on retrieval doing its job.
This article covers the two retrieval strategies I use: pure vector search (cosine similarity via pgvector) and hybrid search (vector + PostgreSQL full-text search combined). I'll show the actual queries I run, why hybrid search outperforms pure vector search for certain query types, and how to re-rank results before passing them to the LLM.
Table of Contents
Pure Vector Search
The simplest possible retrieval: embed the query, find the k nearest chunks by cosine distance.
The WHERE ... >= :min_sim filter prevents returning chunks that are technically nearest but semantically unrelated. Without it, the query always returns exactly limit rows β even if none of them are relevant to the question.
Cosine Similarity vs Cosine Distance
pgvector's <=> operator returns cosine distance (0 = identical, 2 = maximally different for normalized vectors). Most people (and most tutorials) think in terms of cosine similarity (1 = identical, -1 = opposite), so I convert in the SELECT list:
This means:
similarity = 1.0β identical vectors (exact match)similarity = 0.9β very similarsimilarity = 0.7β somewhat similar, usually still relevantsimilarity = 0.5β weak semantic overlapsimilarity < 0.3β likely unrelated
My min_similarity = 0.3 threshold is empirical. I tested it on a sample of questions against my git-book corpus and found that results below 0.3 were almost never useful.
Setting a Minimum Similarity Threshold
The right threshold depends on:
The embedding model (models with more dimensions can have lower typical similarity scores for "related" content)
The domain (technical documentation tends to have narrower topic clusters than general prose)
The query type (exact factual questions have higher similarity ceilings than broad conceptual questions)
I tune threshold by logging similarity distributions for actual queries and looking at where the useful/useless content splits:
Running this in a debug session over a set of representative queries gave me a clear picture of where the cliff was in my corpus.
Full-Text Search in PostgreSQL
Vector search misses exact keyword matches. If someone asks "what is the ef_search parameter in HNSW?", the cosine similarity of "ef_search" to the chunk that explains it might not be highest β the word is a proper noun with no semantic embedding context. Full-text search (tsvector / tsquery) handles exact tokens perfectly.
plainto_tsquery converts the query string into a PostgreSQL text search query safely, without requiring the caller to escape special characters or know the tsquery syntax.
Hybrid Search: RRF Fusion
Neither vector search nor full-text search is universally better. The best approach is to run both and combine the results. I use Reciprocal Rank Fusion (RRF), a simple and effective algorithm for merging ranked lists:
RRF(d)=βrβRβk+r(d)1β
Where:
$R$ is the set of ranked lists (vector results, fulltext results)
$r(d)$ is the rank of document $d$ in list $R$
$k$ is a constant that dampens the impact of high ranks (typically 60)
In practice: a chunk that appears at rank 1 in both lists gets a much higher combined score than one that appears at rank 1 in one list and rank 20 in the other.
The vector_weight: 0.7 / fulltext_weight: 0.3 split is my default. For queries with explicit technical terms (command names, exact parameters, config keys), I increase fulltext_weight. For vague conceptual questions ("how does the watch-loop work"), I increase vector_weight.
Retrieval Context Struct
The retrieval result is returned as a list of RetrievedChunk objects that the generation layer turns into prompt context:
total_tokens is used in the prompt builder to ensure the retrieved context doesn't exceed the LLM's context window. If the top-5 chunks collectively exceed the context budget, the generation layer drops the lowest-scoring chunk and re-checks.
Full Retrieval Implementation
What I Learned
Hybrid search meaningfully outperforms pure vector search for technical content. I measured this informally by running 30 representative questions from my own notes against both strategies and checking whether the top result contained the actual answer. Pure vector: 72% top-1 accuracy. Hybrid: 87% top-1 accuracy. The biggest wins were on questions containing specific technical terms β command names, acronyms, proper nouns β that the vector model doesn't embed well.
limit * 2 in each sub-search is important for RRF. RRF needs enough candidates from each source to fuse effectively. If I request limit=5 from each source and they return disjoint sets, the fusion has no overlapping results to boost. Fetching 10 candidates from each source before re-ranking to 5 gives RRF more material to work with.
The similarity threshold needs to be re-tuned per domain and model. When I switched embedding models, I had to re-tune min_similarity from 0.3 to 0.25 because the new model's score distribution was shifted. A threshold calibrated for one model is not portable.
Logging what retrieval returns is the fastest way to debug bad answers. When the LLM gives a wrong or hallucinated answer, the first thing I check is what chunks were retrieved. If the right chunk wasn't in the retrieved set, it's a retrieval problem (chunking or embedding). If the right chunk was retrieved but the answer is still wrong, it's a generation problem (prompt construction or model reasoning). These require completely different fixes.
Next: Article 6 β Prompt Construction and the Generation Layer
Last updated