Part 4: Embeddings and Vector Search

The Concept That Changed How I Build Everything

Before I understood embeddings, my search systems were all keyword-based. If a user asked "how to deploy containers" and my documentation said "pushing Docker images to a registry," the system returned nothing. Same concept, different words, zero results.

Embeddings changed that. An embedding converts text into a high-dimensional vector β€” a list of numbers β€” that captures the semantic meaning of the text. Similar meanings produce similar vectors, regardless of the exact words used. When I integrated embedding-based search into my personal knowledge base, the quality of retrieval improved dramatically.

This article covers embeddings from the AI engineer's perspective: what they are, how to generate them, how to store and search them, and the practical decisions I made building my own systems.


What Embeddings Actually Represent

An embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text in a geometric space. Texts with similar meanings have vectors that are close together.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "How to deploy a Docker container",
    "Pushing images to a container registry",
    "Making pasta from scratch",
]

embeddings = model.encode(texts)

print(f"Shape: {embeddings.shape}")    # (3, 384)
print(f"First 5 dims: {embeddings[0][:5]}")  # [-0.034, 0.089, ...]

Each text becomes a 384-dimensional vector (with all-MiniLM-L6-v2). You can't read individual dimensions β€” they don't correspond to human-understandable features. But you can measure distances between vectors:

The Docker and container registry texts have high similarity (~0.65) because they describe related concepts. The Docker and pasta texts have near-zero similarity because they're completely unrelated. This is the core mechanism behind semantic search.


Generating Embeddings

I've used two approaches in my projects: local models and API providers.

Local Embeddings with sentence-transformers

This is what I use for development and for projects where I want zero API dependency:

Key decisions I made here:

  • normalize_embeddings=True: This normalizes vectors to unit length. When vectors are normalized, cosine similarity equals dot product, which is faster to compute. Always normalize.

  • all-MiniLM-L6-v2: A 22M parameter model that produces 384-dimensional embeddings. It's small enough to run on a laptop CPU in milliseconds, and the quality is good enough for most retrieval tasks I've needed.

API-Based Embeddings

For production systems where I need higher-quality embeddings:

Choosing Between Local and API Embeddings

Factor

Local (all-MiniLM-L6-v2)

API (text-embedding-3-small)

Dimension

384

1536

Quality

Good for general retrieval

Better for nuanced similarity

Latency

<10ms per batch

100-300ms per batch (network)

Cost

Free

~$0.02 per 1M tokens

Offline

Yes

No

Batch size

Limited by RAM

Limited by API (2048 texts)

For my git-book RAG project, I started with local embeddings during development and switched to API embeddings when I wanted better retrieval quality for production. The provider abstraction from Part 2 made this a config change.


Vector Similarity β€” How Search Actually Works

Once you have embeddings stored, you need to find the most similar ones to a query. There are three common distance metrics:

Cosine Similarity

Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical).

Euclidean (L2) Distance

Measures straight-line distance between two points. Range: 0 (identical) to ∞.

Inner Product (Dot Product)

Measures alignment. For normalized vectors, equals cosine similarity.

Which One to Use?

I use cosine similarity (or equivalently, cosine distance) for almost everything. Here's why:

  • It's magnitude-independent β€” it only cares about direction, not vector length

  • Most embedding models are designed to be evaluated with cosine similarity

  • When vectors are normalized, cosine similarity equals dot product, so you get the speed benefit of dot product with the semantics of cosine

In pgvector, cosine distance is the operator <=>:


Storing Vectors with pgvector

I use PostgreSQL with the pgvector extension for vector storage. I considered dedicated vector databases (Pinecone, Weaviate, Qdrant), but for my project scale, pgvector has significant advantages:

  • I already use PostgreSQL for application data β€” no extra infrastructure

  • Full SQL capabilities alongside vector search

  • ACID transactions β€” I can insert documents and their embeddings atomically

  • Mature tooling β€” SQLAlchemy, Alembic, pg_dump all work

Setting Up pgvector

Database Schema

Async Database Engine


The Ingestion Pipeline

Before you can search, you need to embed and store your text. Here's the pipeline I built for my knowledge base:

Step 1: Load Documents

Step 2: Chunk Text

Chunking is where most RAG systems go wrong. The goal is to split text into pieces small enough to embed meaningfully but large enough to preserve context.

I tried three chunking strategies on my content:

Strategy
Recall@5
Notes

Fixed size (500 chars)

0.52

Splits mid-sentence, loses context

Recursive text splitting

0.64

Better boundaries but still arbitrary

Header-based splitting

0.78

Best for structured markdown content

Header-based splitting worked best because my git-book articles use consistent heading structure. Each section under an H2 or H3 is a self-contained unit of knowledge.

Step 3: Embed and Store


With documents ingested, searching is straightforward:

Output from my knowledge base:

The top result is exactly the article about pgvector setup, even though the query and the document use slightly different phrasing.


Indexing for Performance

Without an index, pgvector does a sequential scan β€” it compares the query vector against every stored vector. This is fine for thousands of chunks but becomes slow at hundreds of thousands.

IVFFlat Index

IVFFlat partitions vectors into clusters (lists). At query time, it only searches a subset of clusters. The lists parameter controls the number of partitions β€” the pgvector docs recommend sqrt(n) for up to 1M rows.

HNSW Index

HNSW (Hierarchical Navigable Small World) is a graph-based index. In my testing:

Index
Build Time
Query Time
Recall@10
Memory

None (seq scan)

0s

45ms

1.00

Baseline

IVFFlat (100 lists)

2s

3ms

0.95

+15%

HNSW (m=16)

8s

1ms

0.99

+40%

For my knowledge base (~10k chunks), either index is fast enough. I use HNSW because the recall is better and the query time is consistently low.


Practical Lessons

Things I learned building embedding-based search into my projects:

  1. Normalize your embeddings. Always set normalize_embeddings=True when using sentence-transformers. This makes cosine distance equivalent to dot product, which is faster and avoids bugs where magnitude affects results.

  2. Chunk size matters more than embedding model. Switching from 500-char fixed chunks to header-based chunks improved my Recall@5 by 50%. Switching from all-MiniLM-L6-v2 to text-embedding-3-small improved it by 10%. Invest in chunking first.

  3. Set a minimum similarity threshold. Without it, every query returns results β€” even completely irrelevant ones. I use 0.3 as a minimum. Below that, it's better to return "I don't have information about this" than to surface noise.

  4. Embed queries and documents with the same model. This sounds obvious, but I once had a bug where ingestion used all-MiniLM-L6-v2 and search used text-embedding-3-small. The similarity scores were meaningless because the vector spaces were different.

  5. pgvector is good enough. For sub-million vector collections, PostgreSQL with pgvector handles everything I need. I get SQL joins (fetching document metadata alongside chunks), ACID transactions, and familiar tooling. I'd consider a dedicated vector database if I needed billions of vectors or real-time index updates at scale.


Previous: Part 3 β€” How LLMs Work

Next: Part 5 β€” Prompt Engineering for Production Systems

Last updated