Article 2: pgvector on PostgreSQL — Setup, Vector Types, and Indexes

Introduction

pgvector is a PostgreSQL extension that adds a native vector data type and similarity search operators. It's the difference between needing a separate vector database service and storing your embeddings in the same PostgreSQL instance you're already running.

This article covers installing pgvector, the vector column type, the two index types (HNSW and IVFFlat), and the SQLAlchemy 2 async schema I use in the RAG service. I also explain my index choice and why I made it.


Table of Contents


Installing pgvector

Docker Compose (what I run)

The easiest setup is pgvector/pgvector:pg16 — a pre-built image with the extension already installed:

Once the container starts, enable the extension in the target database:

Alembic handles this in the initial migration (covered below), so I don't run it manually.

On an existing PostgreSQL instance

If you're adding pgvector to an existing PostgreSQL 16 installation on Debian/Ubuntu:


The vector Data Type

vector(n) stores a fixed-dimension floating-point array. The dimension must match the embedding model you use:

Model
Dimensions

all-MiniLM-L6-v2 (sentence-transformers)

384

text-embedding-3-small (OpenAI / GitHub Models)

1536

text-embedding-3-large (OpenAI / GitHub Models)

3072

togethercomputer/m2-bert-80M-8k-retrieval

768

The dimension is fixed at column definition time. In the RAG service I use 384 (local model) by default and make it configurable via an environment variable for when I want to switch to an API-based model.


Similarity Operators

pgvector provides three distance operators:

Operator
Distance Metric
Use Case

<=>

Cosine distance

Most embedding models (normalized vectors)

<->

Euclidean (L2) distance

When vectors are not normalized

<#>

Negative inner product

When using dot-product similarity

For sentence-transformer embeddings I always use cosine distance (<=>). The models produce normalized vectors, and cosine similarity correctly captures directional closeness (same meaning) vs. magnitude.

1 - cosine_distance converts to cosine similarity (higher = more similar). I return this as a score in the API response so callers can apply a minimum-similarity threshold.


Index Types: HNSW vs IVFFlat

Without an index, pgvector does an exact sequential scan of every row. For a few thousand chunks this is fast enough, but with tens of thousands of chunks (or more) you want an approximate nearest-neighbor (ANN) index.

HNSW (Hierarchical Navigable Small World)

  • Builds a multi-layer graph. Query traverses from the top layer down to approximate neighbors.

  • Pros: High recall, fast queries, no need to specify the number of clusters up front

  • Cons: Higher memory usage during build; INSERT is slower than IVFFlat because the graph must be updated on every new row

  • Best for: Real-time ingestion with frequent inserts

IVFFlat (Inverted File with Flat Quantization)

  • Clusters vectors into lists Voronoi cells at build time; queries search probes cells.

  • Pros: Faster build, lower memory

  • Cons: Must rebuild when data distribution changes significantly; needs existing data to create

  • Best for: Batch-ingested, relatively static corpora

I use HNSW. My ingestion pipeline runs incrementally — new articles are added continuously. HNSW handles inserts cleanly without requiring a full index rebuild. The memory overhead is not a concern on any machine running PostgreSQL.


Database Schema for RAG

Three tables: documents, chunks, and a ingestion_jobs tracking table.

The file_hash column on documents is how I detect changes. Before re-ingesting a file, the pipeline computes SHA-256 of the current content and compares it against the stored hash. If they match, the file hasn't changed and re-ingestion is skipped.


SQLAlchemy 2 Async Setup

pgvector.sqlalchemy.Vector is the SQLAlchemy column type provided by the pgvector Python package. It maps directly to the vector(n) PostgreSQL type.


Alembic Migration

The initial migration creates the extension and all tables:

Run migrations:


Verifying the Setup

After running migrations and starting the PostgreSQL container, verify everything is working:


What I Learned

The HNSW index build is automatic for inserts but requires tuning. The default m=16, ef_construction=64 is a reasonable starting point. m controls the number of bidirectional links per node — higher means better recall but more memory. ef_construction controls build-time search effort — higher means better quality index but slower inserts. For a personal knowledge base I haven't needed to touch these defaults.

ef_search at query time is separate from ef_construction. You can tune recall at query time without rebuilding the index:

Higher ef_search = higher recall at the cost of query latency. I run with 100 because I care more about not missing relevant chunks than about shaving milliseconds off the search.

The full-text index deserves as much thought as the vector index. The gin(to_tsvector('english', content)) index is what makes hybrid search possible. I set it up from the start rather than adding it later, because adding a GIN index to a large table is slow and locks the table briefly.

Dimension is not a free variable. When I considered switching from all-MiniLM-L6-v2 (384 dims) to text-embedding-3-small (1536 dims), it required a migration to change the vector(384) column to vector(1536) and re-embed everything. Choosing the embedding model early and sticking with it saves this pain.


Next: Article 3 — Document Loading and Chunking Strategies

Last updated