Article 5: Semantic Search, Cosine Similarity, and Hybrid Retrieval

Introduction

Retrieval is the "R" in RAG. Given a user's question, the retrieval layer finds the chunks most likely to contain the answer. The quality of everything downstream β€” the generated response, the cited sources, the accuracy of the answer β€” depends on retrieval doing its job.

This article covers the two retrieval strategies I use: pure vector search (cosine similarity via pgvector) and hybrid search (vector + PostgreSQL full-text search combined). I'll show the actual queries I run, why hybrid search outperforms pure vector search for certain query types, and how to re-rank results before passing them to the LLM.


Table of Contents


The simplest possible retrieval: embed the query, find the k nearest chunks by cosine distance.

The WHERE ... >= :min_sim filter prevents returning chunks that are technically nearest but semantically unrelated. Without it, the query always returns exactly limit rows β€” even if none of them are relevant to the question.


Cosine Similarity vs Cosine Distance

pgvector's <=> operator returns cosine distance (0 = identical, 2 = maximally different for normalized vectors). Most people (and most tutorials) think in terms of cosine similarity (1 = identical, -1 = opposite), so I convert in the SELECT list:

This means:

  • similarity = 1.0 β†’ identical vectors (exact match)

  • similarity = 0.9 β†’ very similar

  • similarity = 0.7 β†’ somewhat similar, usually still relevant

  • similarity = 0.5 β†’ weak semantic overlap

  • similarity < 0.3 β†’ likely unrelated

My min_similarity = 0.3 threshold is empirical. I tested it on a sample of questions against my git-book corpus and found that results below 0.3 were almost never useful.


Setting a Minimum Similarity Threshold

The right threshold depends on:

  1. The embedding model (models with more dimensions can have lower typical similarity scores for "related" content)

  2. The domain (technical documentation tends to have narrower topic clusters than general prose)

  3. The query type (exact factual questions have higher similarity ceilings than broad conceptual questions)

I tune threshold by logging similarity distributions for actual queries and looking at where the useful/useless content splits:

Running this in a debug session over a set of representative queries gave me a clear picture of where the cliff was in my corpus.


Vector search misses exact keyword matches. If someone asks "what is the ef_search parameter in HNSW?", the cosine similarity of "ef_search" to the chunk that explains it might not be highest β€” the word is a proper noun with no semantic embedding context. Full-text search (tsvector / tsquery) handles exact tokens perfectly.

plainto_tsquery converts the query string into a PostgreSQL text search query safely, without requiring the caller to escape special characters or know the tsquery syntax.


Neither vector search nor full-text search is universally better. The best approach is to run both and combine the results. I use Reciprocal Rank Fusion (RRF), a simple and effective algorithm for merging ranked lists:

RRF(d)=βˆ‘r∈R1k+r(d)\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}

Where:

  • $R$ is the set of ranked lists (vector results, fulltext results)

  • $r(d)$ is the rank of document $d$ in list $R$

  • $k$ is a constant that dampens the impact of high ranks (typically 60)

In practice: a chunk that appears at rank 1 in both lists gets a much higher combined score than one that appears at rank 1 in one list and rank 20 in the other.

The vector_weight: 0.7 / fulltext_weight: 0.3 split is my default. For queries with explicit technical terms (command names, exact parameters, config keys), I increase fulltext_weight. For vague conceptual questions ("how does the watch-loop work"), I increase vector_weight.


Retrieval Context Struct

The retrieval result is returned as a list of RetrievedChunk objects that the generation layer turns into prompt context:

total_tokens is used in the prompt builder to ensure the retrieved context doesn't exceed the LLM's context window. If the top-5 chunks collectively exceed the context budget, the generation layer drops the lowest-scoring chunk and re-checks.


Full Retrieval Implementation


What I Learned

Hybrid search meaningfully outperforms pure vector search for technical content. I measured this informally by running 30 representative questions from my own notes against both strategies and checking whether the top result contained the actual answer. Pure vector: 72% top-1 accuracy. Hybrid: 87% top-1 accuracy. The biggest wins were on questions containing specific technical terms β€” command names, acronyms, proper nouns β€” that the vector model doesn't embed well.

limit * 2 in each sub-search is important for RRF. RRF needs enough candidates from each source to fuse effectively. If I request limit=5 from each source and they return disjoint sets, the fusion has no overlapping results to boost. Fetching 10 candidates from each source before re-ranking to 5 gives RRF more material to work with.

The similarity threshold needs to be re-tuned per domain and model. When I switched embedding models, I had to re-tune min_similarity from 0.3 to 0.25 because the new model's score distribution was shifted. A threshold calibrated for one model is not portable.

Logging what retrieval returns is the fastest way to debug bad answers. When the LLM gives a wrong or hallucinated answer, the first thing I check is what chunks were retrieved. If the right chunk wasn't in the retrieved set, it's a retrieval problem (chunking or embedding). If the right chunk was retrieved but the answer is still wrong, it's a generation problem (prompt construction or model reasoning). These require completely different fixes.


Next: Article 6 β€” Prompt Construction and the Generation Layer

Last updated