Article 4: Generating and Storing Embeddings in pgvector

Introduction

After the chunker produces a list of text segments, each segment needs to be converted to a vector β€” a list of floating-point numbers that represents its semantic meaning in a high-dimensional space. That vector is what gets stored in pgvector and queried during retrieval.

This article covers the two embedding providers I use (sentence-transformers for local inference and the GitHub Models API for API-based inference), the batching strategy that makes bulk ingestion practical, and how the vectors are stored in the database.


Table of Contents


What an Embedding Is

An embedding model takes text and outputs a fixed-length vector of floats. Models are trained so that semantically similar texts produce vectors that are close together in vector space (measured by cosine distance), while unrelated texts are far apart.

For example:

  • "configure TLS on Kubernetes ingress" and "set up HTTPS certificate for ingress controller" should have high cosine similarity

  • "configure TLS on Kubernetes ingress" and "Python list comprehension syntax" should have low cosine similarity

The model learns this structure from large text corpora β€” it's the same training objective as language model pre-training but the output is a dense vector rather than a next-token prediction.

The embedding is not interpretable directly. You can't look at position 142 of a 384-dimensional vector and know what it means. What matters is the relative distances, not the absolute values.


Embedding Provider Design

I use an abstract base class so the ingestion pipeline and query pipeline are decoupled from the specific embedding model:

Both the local and API provider implement this interface, and the ingestion pipeline only knows about EmbeddingProvider. Switching providers requires changing one environment variable.


Local Embeddings with sentence-transformers

all-MiniLM-L6-v2 is the model I use for local inference. It produces 384-dimensional vectors, is fast (~10ms per batch on CPU), has an 256-token input limit, and its quality is good enough for sentence and short-paragraph similarity.

normalize_embeddings=True is important. It L2-normalizes each vector before returning it, which means cosine similarity is equivalent to dot product, and the <=> pgvector operator gives meaningful results.

Running model inference in run_in_executor is necessary because SentenceTransformer.encode() is synchronous and CPU-bound β€” calling it directly from an async function would block the event loop.


API Embeddings with GitHub Models

When I want higher-quality embeddings (particularly for longer documents), I use text-embedding-3-small via the GitHub Models API. The same OpenAI SDK client works because GitHub Models uses the OpenAI API format.

The GitHub Models API has a batch limit of 2,048 input items per call. In practice my batches are much smaller (32–64 texts), so this is not an issue.


Batching for Bulk Ingestion

Embedding models β€” both local and API β€” are much more efficient in batch mode than one-at-a-time. A batch of 64 texts takes roughly the same time as a batch of 1 on the local model (because the GPU/CPU can parallelize the matrix multiplications), and using API batches reduces the number of HTTP round-trips.

The bulk update uses per-row UPDATE statements. For very large corpora, this could be improved with executemany or a temporary table approach, but for a personal knowledge base of a few thousand chunks it's fast enough.


Storing Embeddings in pgvector

The pgvector Python package provides Vector for SQLAlchemy and handles serialization automatically. When writing a Python list[float] to a Vector(384) column, the package converts it to the binary wire format that PostgreSQL expects.

For the raw SQL path (used in retrieval), vectors are cast to the PostgreSQL type using ::vector:

The :query_vec::vector cast in the raw SQL query is the pattern that pgvector uses when you pass a Python list as a string representation of the vector.


The Embedding Background Worker

The ingestion pipeline creates chunks with embedding=NULL and then triggers a background task to embed them asynchronously. This keeps the ingestion HTTP response fast (the file is loaded and chunked immediately) while the embedding computation happens in the background.

This worker is started as a background task in FastAPI's lifespan. If the process restarts mid-embedding (e.g., power outage, container restart), the worker will find the partially-embedded chunks on next startup and resume from where it left off β€” because the chunk embedding column is NULL until it's successfully committed.


What I Learned

normalize_embeddings=True is easy to forget and hard to diagnose. If you forget to normalize, cosine similarity doesn't work correctly β€” you're measuring the angle between vectors, but unnormalized vectors have magnitudes that pollute the distance calculation. The symptom is "retrieval returns random-looking results". Always normalize.

Batch size of 64 is a reasonable default but test on your hardware. On my laptop with an Apple M2, the local model processes 64 texts in about 60ms. On a CPU-only x86 machine in a cloud VM, the same batch takes ~350ms. There's no universal optimal batch size β€” measure on your target hardware.

Embedding after ingestion, not during, keeps the API responsive. My first version embedded synchronously inside the HTTP handler. Ingesting 200 files took 40 seconds and the HTTP connection timed out. The background worker pattern eliminates this β€” the ingestion call returns in under 2 seconds, and embedding happens asynchronously over the following minutes.

Track which model embedded each chunk. I store embedding_model on each Chunk row. When I switched from all-MiniLM-L6-v2 to text-embedding-3-small, I needed to know which chunks were embedded with which model β€” mixing models in the same similarity search produces garbage results because they live in different vector spaces. The embedding_model column made it easy to find and re-embed the old chunks.


Next: Article 5 β€” Semantic Search, Cosine Similarity, and Hybrid Retrieval

Last updated