Article 5: Semantic Search, Cosine Similarity, and Hybrid Retrieval

Introduction

Retrieval is the "R" in RAG. Given a user's question, the retrieval layer finds the chunks most likely to contain the answer. The quality of everything downstream — the generated response, the cited sources, the accuracy of the answer — depends on retrieval doing its job.

This article covers the two retrieval strategies I use: pure vector search (cosine similarity via pgvector) and hybrid search (vector + PostgreSQL full-text search combined). I'll show the actual queries I run, why hybrid search outperforms pure vector search for certain query types, and how to re-rank results before passing them to the LLM.

Pure Vector Search

The simplest possible retrieval: embed the query, find the k nearest chunks by cosine distance.

# src/retrieval/vector_search.py
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from src.embeddings.base import EmbeddingProvider
from src.retrieval.models import RetrievedChunk
import structlog

log = structlog.get_logger()

async def vector_search(
    query: str,
    db: AsyncSession,
    embedder: EmbeddingProvider,
    limit: int = 5,
    min_similarity: float = 0.3,
) -> list[RetrievedChunk]:
    query_vector = await embedder.embed_one(query)
    
    rows = await db.execute(
        text("""
            SELECT
                c.id,
                c.content,
                c.tokens,
                c.chunk_index,
                d.file_path,
                d.title,
                1 - (c.embedding <=> :qv::vector) AS similarity
            FROM chunks c
            JOIN documents d ON d.id = c.document_id
            WHERE c.embedding IS NOT NULL
              AND 1 - (c.embedding <=> :qv::vector) >= :min_sim
            ORDER BY c.embedding <=> :qv::vector
            LIMIT :limit
        """),
        {
            "qv":      str(query_vector),
            "min_sim": min_similarity,
            "limit":   limit,
        },
    )
    
    results = [
        RetrievedChunk(
            chunk_id=row.id,
            content=row.content,
            tokens=row.tokens,
            chunk_index=row.chunk_index,
            file_path=row.file_path,
            title=row.title,
            similarity=row.similarity,
            source="vector",
        )
        for row in rows
    ]
    
    log.info("retrieval.vector", query_len=len(query), results=len(results))
    return results

The WHERE ... >= :min_sim filter prevents returning chunks that are technically nearest but semantically unrelated. Without it, the query always returns exactly limit rows — even if none of them are relevant to the question.

Cosine Similarity vs Cosine Distance

pgvector's <=> operator returns cosine distance (0 = identical, 2 = maximally different for normalized vectors). Most people (and most tutorials) think in terms of cosine similarity (1 = identical, -1 = opposite), so I convert in the SELECT list:

1 - (c.embedding <=> :qv::vector) AS similarity

This means:

similarity = 1.0 → identical vectors (exact match)
similarity = 0.9 → very similar
similarity = 0.7 → somewhat similar, usually still relevant
similarity = 0.5 → weak semantic overlap
similarity < 0.3 → likely unrelated

My min_similarity = 0.3 threshold is empirical. I tested it on a sample of questions against my git-book corpus and found that results below 0.3 were almost never useful.

Setting a Minimum Similarity Threshold

The right threshold depends on:

The embedding model (models with more dimensions can have lower typical similarity scores for "related" content)
The domain (technical documentation tends to have narrower topic clusters than general prose)
The query type (exact factual questions have higher similarity ceilings than broad conceptual questions)

I tune threshold by logging similarity distributions for actual queries and looking at where the useful/useless content splits:

# During development — log the full ranked list, not just filtered results
async def debug_similarity_scores(
    query: str,
    db: AsyncSession,
    embedder: EmbeddingProvider,
    limit: int = 20,
) -> None:
    query_vector = await embedder.embed_one(query)
    rows = await db.execute(
        text("""
            SELECT d.file_path, c.content[:80], 1 - (c.embedding <=> :qv::vector) AS sim
            FROM chunks c JOIN documents d ON d.id = c.document_id
            WHERE c.embedding IS NOT NULL
            ORDER BY c.embedding <=> :qv::vector
            LIMIT :limit
        """),
        {"qv": str(query_vector), "limit": limit},
    )
    for i, row in enumerate(rows):
        print(f"{i+1:2d}  sim={row.sim:.3f}  {row.file_path}  {row.content[:80].replace(chr(10), ' ')}")

Running this in a debug session over a set of representative queries gave me a clear picture of where the cliff was in my corpus.

Full-Text Search in PostgreSQL

Vector search misses exact keyword matches. If someone asks "what is the ef_search parameter in HNSW?", the cosine similarity of "ef_search" to the chunk that explains it might not be highest — the word is a proper noun with no semantic embedding context. Full-text search (tsvector / tsquery) handles exact tokens perfectly.

# src/retrieval/vector_search.py
async def fulltext_search(
    query: str,
    db: AsyncSession,
    limit: int = 5,
) -> list[RetrievedChunk]:
    # Convert query to tsquery — plainto_tsquery handles natural language safely
    rows = await db.execute(
        text("""
            SELECT
                c.id,
                c.content,
                c.tokens,
                c.chunk_index,
                d.file_path,
                d.title,
                ts_rank_cd(to_tsvector('english', c.content), query) AS rank
            FROM chunks c
            JOIN documents d ON d.id = c.document_id,
            plainto_tsquery('english', :query) query
            WHERE to_tsvector('english', c.content) @@ query
            ORDER BY rank DESC
            LIMIT :limit
        """),
        {"query": query, "limit": limit},
    )
    
    return [
        RetrievedChunk(
            chunk_id=row.id,
            content=row.content,
            tokens=row.tokens,
            chunk_index=row.chunk_index,
            file_path=row.file_path,
            title=row.title,
            similarity=float(row.rank),  # BM25-style rank, not cosine similarity
            source="fulltext",
        )
        for row in rows
    ]

plainto_tsquery converts the query string into a PostgreSQL text search query safely, without requiring the caller to escape special characters or know the tsquery syntax.

Hybrid Search: RRF Fusion

Neither vector search nor full-text search is universally better. The best approach is to run both and combine the results. I use Reciprocal Rank Fusion (RRF), a simple and effective algorithm for merging ranked lists:

$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}$

Where:

$R$ is the set of ranked lists (vector results, fulltext results)
$r(d)$ is the rank of document $d$ in list $R$
$k$ is a constant that dampens the impact of high ranks (typically 60)

In practice: a chunk that appears at rank 1 in both lists gets a much higher combined score than one that appears at rank 1 in one list and rank 20 in the other.

# src/retrieval/hybrid_search.py
import asyncio
from src.retrieval.vector_search import vector_search, fulltext_search
from src.retrieval.models import RetrievedChunk
from sqlalchemy.ext.asyncio import AsyncSession
from src.embeddings.base import EmbeddingProvider

def rrf_score(rank: int, k: int = 60) -> float:
    return 1.0 / (k + rank)

async def hybrid_search(
    query: str,
    db: AsyncSession,
    embedder: EmbeddingProvider,
    limit: int = 5,
    min_similarity: float = 0.25,   # Lower threshold for hybrid (more permissive)
    vector_weight: float = 0.7,     # Weight for vector results in RRF
    fulltext_weight: float = 0.3,   # Weight for fulltext results in RRF
) -> list[RetrievedChunk]:
    # Run vector and full-text search in parallel
    vector_results, fts_results = await asyncio.gather(
        vector_search(query, db, embedder, limit=limit * 2, min_similarity=min_similarity),
        fulltext_search(query, db, limit=limit * 2),
    )
    
    # Build score map: chunk_id → combined RRF score
    scores: dict[int, float] = {}
    chunks: dict[int, RetrievedChunk] = {}
    
    for rank, chunk in enumerate(vector_results):
        score = rrf_score(rank) * vector_weight
        scores[chunk.chunk_id] = scores.get(chunk.chunk_id, 0) + score
        chunks[chunk.chunk_id] = chunk
    
    for rank, chunk in enumerate(fts_results):
        score = rrf_score(rank) * fulltext_weight
        scores[chunk.chunk_id] = scores.get(chunk.chunk_id, 0) + score
        if chunk.chunk_id not in chunks:
            chunks[chunk.chunk_id] = chunk
    
    # Sort by combined score, return top k
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:limit]
    
    results = []
    for chunk_id, score in ranked:
        chunk = chunks[chunk_id]
        results.append(RetrievedChunk(
            chunk_id=chunk.chunk_id,
            content=chunk.content,
            tokens=chunk.tokens,
            chunk_index=chunk.chunk_index,
            file_path=chunk.file_path,
            title=chunk.title,
            similarity=score,  # Combined RRF score (not cosine)
            source="hybrid",
        ))
    
    return results

The vector_weight: 0.7 / fulltext_weight: 0.3 split is my default. For queries with explicit technical terms (command names, exact parameters, config keys), I increase fulltext_weight. For vague conceptual questions ("how does the watch-loop work"), I increase vector_weight.

Retrieval Context Struct

The retrieval result is returned as a list of RetrievedChunk objects that the generation layer turns into prompt context:

# src/retrieval/models.py
from dataclasses import dataclass

@dataclass
class RetrievedChunk:
    chunk_id:    int
    content:     str
    tokens:      int
    chunk_index: int
    file_path:   str
    title:       str | None
    similarity:  float
    source:      str    # "vector" | "fulltext" | "hybrid"

@dataclass
class RetrievalResult:
    query:   str
    chunks:  list[RetrievedChunk]
    total_tokens: int    # Sum of tokens across all retrieved chunks
    strategy: str        # Which retrieval strategy was used
    
    @classmethod
    def from_chunks(cls, query: str, chunks: list[RetrievedChunk], strategy: str) -> "RetrievalResult":
        return cls(
            query=query,
            chunks=chunks,
            total_tokens=sum(c.tokens for c in chunks),
            strategy=strategy,
        )

total_tokens is used in the prompt builder to ensure the retrieved context doesn't exceed the LLM's context window. If the top-5 chunks collectively exceed the context budget, the generation layer drops the lowest-scoring chunk and re-checks.

Full Retrieval Implementation

# src/retrieval/retriever.py
from enum import Enum
from src.retrieval.hybrid_search import hybrid_search
from src.retrieval.vector_search import vector_search
from src.retrieval.models import RetrievalResult
from src.embeddings.base import EmbeddingProvider
from sqlalchemy.ext.asyncio import AsyncSession
import structlog

log = structlog.get_logger()

class RetrievalStrategy(str, Enum):
    VECTOR  = "vector"
    HYBRID  = "hybrid"

class Retriever:
    def __init__(
        self,
        embedder: EmbeddingProvider,
        default_strategy: RetrievalStrategy = RetrievalStrategy.HYBRID,
        default_limit: int = 5,
        min_similarity: float = 0.3,
    ):
        self.embedder = embedder
        self.default_strategy = default_strategy
        self.default_limit = default_limit
        self.min_similarity = min_similarity
    
    async def retrieve(
        self,
        query: str,
        db: AsyncSession,
        limit: int | None = None,
        strategy: RetrievalStrategy | None = None,
    ) -> RetrievalResult:
        k = limit or self.default_limit
        strat = strategy or self.default_strategy
        
        log.info("retrieval.start", strategy=strat, query_len=len(query), limit=k)
        
        match strat:
            case RetrievalStrategy.VECTOR:
                chunks = await vector_search(
                    query, db, self.embedder, limit=k, min_similarity=self.min_similarity
                )
            case RetrievalStrategy.HYBRID:
                chunks = await hybrid_search(
                    query, db, self.embedder, limit=k, min_similarity=self.min_similarity
                )
        
        result = RetrievalResult.from_chunks(query, chunks, strategy=strat)
        
        log.info(
            "retrieval.complete",
            strategy=strat,
            chunks_found=len(chunks),
            total_tokens=result.total_tokens,
        )
        return result

What I Learned

Hybrid search meaningfully outperforms pure vector search for technical content. I measured this informally by running 30 representative questions from my own notes against both strategies and checking whether the top result contained the actual answer. Pure vector: 72% top-1 accuracy. Hybrid: 87% top-1 accuracy. The biggest wins were on questions containing specific technical terms — command names, acronyms, proper nouns — that the vector model doesn't embed well.

limit * 2 in each sub-search is important for RRF. RRF needs enough candidates from each source to fuse effectively. If I request limit=5 from each source and they return disjoint sets, the fusion has no overlapping results to boost. Fetching 10 candidates from each source before re-ranking to 5 gives RRF more material to work with.

The similarity threshold needs to be re-tuned per domain and model. When I switched embedding models, I had to re-tune min_similarity from 0.3 to 0.25 because the new model's score distribution was shifted. A threshold calibrated for one model is not portable.

Logging what retrieval returns is the fastest way to debug bad answers. When the LLM gives a wrong or hallucinated answer, the first thing I check is what chunks were retrieved. If the right chunk wasn't in the retrieved set, it's a retrieval problem (chunking or embedding). If the right chunk was retrieved but the answer is still wrong, it's a generation problem (prompt construction or model reasoning). These require completely different fixes.

Next: Article 6 — Prompt Construction and the Generation Layer

PreviousArticle 4: Generating and Storing Embeddings in pgvector NextArticle 6: Prompt Construction and the Generation Layer

Last updated 21 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagPure Vector Search

hashtagCosine Similarity vs Cosine Distance

hashtagSetting a Minimum Similarity Threshold

hashtagFull-Text Search in PostgreSQL

hashtagHybrid Search: RRF Fusion

hashtagRetrieval Context Struct

hashtagFull Retrieval Implementation

hashtagWhat I Learned