# Article 2: pgvector on PostgreSQL — Setup, Vector Types, and Indexes

## Introduction

pgvector is a PostgreSQL extension that adds a native `vector` data type and similarity search operators. It's the difference between needing a separate vector database service and storing your embeddings in the same PostgreSQL instance you're already running.

This article covers installing pgvector, the `vector` column type, the two index types (HNSW and IVFFlat), and the SQLAlchemy 2 async schema I use in the RAG service. I also explain my index choice and why I made it.

***

## Table of Contents

1. [Installing pgvector](#installing-pgvector)
2. [The vector Data Type](#the-vector-data-type)
3. [Similarity Operators](#similarity-operators)
4. [Index Types: HNSW vs IVFFlat](#index-types)
5. [Database Schema for RAG](#database-schema)
6. [SQLAlchemy 2 Async Setup](#sqlalchemy-setup)
7. [Alembic Migration](#alembic-migration)
8. [Verifying the Setup](#verifying-the-setup)
9. [What I Learned](#what-i-learned)

***

## Installing pgvector

### Docker Compose (what I run)

The easiest setup is `pgvector/pgvector:pg16` — a pre-built image with the extension already installed:

```yaml
# config/docker-compose.yml
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: rag
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: rag_db
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U rag -d rag_db"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  pgdata:
```

Once the container starts, enable the extension in the target database:

```sql
CREATE EXTENSION IF NOT EXISTS vector;
```

Alembic handles this in the initial migration (covered below), so I don't run it manually.

### On an existing PostgreSQL instance

If you're adding pgvector to an existing PostgreSQL 16 installation on Debian/Ubuntu:

```bash
# Install the pgvector extension package
apt install postgresql-16-pgvector

# Then in psql for your database
CREATE EXTENSION IF NOT EXISTS vector;
```

***

## The vector Data Type

`vector(n)` stores a fixed-dimension floating-point array. The dimension must match the embedding model you use:

| Model                                             | Dimensions |
| ------------------------------------------------- | ---------- |
| `all-MiniLM-L6-v2` (sentence-transformers)        | 384        |
| `text-embedding-3-small` (OpenAI / GitHub Models) | 1536       |
| `text-embedding-3-large` (OpenAI / GitHub Models) | 3072       |
| `togethercomputer/m2-bert-80M-8k-retrieval`       | 768        |

The dimension is fixed at column definition time. In the RAG service I use `384` (local model) by default and make it configurable via an environment variable for when I want to switch to an API-based model.

```sql
-- A chunks table with a 384-dimension embedding column
CREATE TABLE chunks (
    id          BIGSERIAL PRIMARY KEY,
    document_id BIGINT REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    content     TEXT NOT NULL,
    tokens      INTEGER,
    embedding   vector(384),
    created_at  TIMESTAMPTZ DEFAULT NOW()
);
```

***

## Similarity Operators

pgvector provides three distance operators:

| Operator | Distance Metric         | Use Case                                   |
| -------- | ----------------------- | ------------------------------------------ |
| `<=>`    | Cosine distance         | Most embedding models (normalized vectors) |
| `<->`    | Euclidean (L2) distance | When vectors are not normalized            |
| `<#>`    | Negative inner product  | When using dot-product similarity          |

For sentence-transformer embeddings I always use cosine distance (`<=>`). The models produce normalized vectors, and cosine similarity correctly captures directional closeness (same meaning) vs. magnitude.

```sql
-- Top 5 most similar chunks to a query embedding
SELECT
    c.id,
    c.content,
    d.file_path,
    1 - (c.embedding <=> '[0.021, -0.043, ...]'::vector) AS similarity
FROM chunks c
JOIN documents d ON d.id = c.document_id
ORDER BY c.embedding <=> '[0.021, -0.043, ...]'::vector
LIMIT 5;
```

`1 - cosine_distance` converts to cosine similarity (higher = more similar). I return this as a score in the API response so callers can apply a minimum-similarity threshold.

***

## Index Types: HNSW vs IVFFlat <a href="#index-types" id="index-types"></a>

Without an index, pgvector does an exact sequential scan of every row. For a few thousand chunks this is fast enough, but with tens of thousands of chunks (or more) you want an approximate nearest-neighbor (ANN) index.

### HNSW (Hierarchical Navigable Small World)

```sql
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```

* Builds a multi-layer graph. Query traverses from the top layer down to approximate neighbors.
* **Pros**: High recall, fast queries, no need to specify the number of clusters up front
* **Cons**: Higher memory usage during build; `INSERT` is slower than IVFFlat because the graph must be updated on every new row
* **Best for**: Real-time ingestion with frequent inserts

### IVFFlat (Inverted File with Flat Quantization)

```sql
-- Must have data before building IVFFlat (needs to choose centroids)
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
```

* Clusters vectors into `lists` Voronoi cells at build time; queries search `probes` cells.
* **Pros**: Faster build, lower memory
* **Cons**: Must rebuild when data distribution changes significantly; needs existing data to create
* **Best for**: Batch-ingested, relatively static corpora

**I use HNSW.** My ingestion pipeline runs incrementally — new articles are added continuously. HNSW handles inserts cleanly without requiring a full index rebuild. The memory overhead is not a concern on any machine running PostgreSQL.

***

## Database Schema for RAG <a href="#database-schema" id="database-schema"></a>

Three tables: `documents`, `chunks`, and a `ingestion_jobs` tracking table.

```sql
-- Documents: one row per source file
CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    file_path   TEXT NOT NULL UNIQUE,  -- e.g. "artificial-intelligence/aiops-101/README.md"
    file_hash   TEXT NOT NULL,         -- SHA-256 of file content, for change detection
    title       TEXT,
    source_type TEXT DEFAULT 'markdown',
    char_count  INTEGER,
    ingested_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Chunks: sections of documents after splitting
CREATE TABLE chunks (
    id            BIGSERIAL PRIMARY KEY,
    document_id   BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index   INTEGER NOT NULL,    -- Position within document
    content       TEXT NOT NULL,       -- Raw text of the chunk
    tokens        INTEGER,             -- Approximate token count
    embedding     vector(384),         -- Null until embedded
    embedding_model TEXT DEFAULT 'all-MiniLM-L6-v2',
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (document_id, chunk_index)
);

CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Also a full-text index for hybrid search
CREATE INDEX ON chunks USING gin(to_tsvector('english', content));

-- Ingestion jobs: track async ingestion status
CREATE TABLE ingestion_jobs (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    status      TEXT NOT NULL DEFAULT 'pending',  -- pending/running/done/failed
    file_path   TEXT,
    file_count  INTEGER,
    chunks_created INTEGER DEFAULT 0,
    error       TEXT,
    started_at  TIMESTAMPTZ DEFAULT NOW(),
    finished_at TIMESTAMPTZ
);
```

The `file_hash` column on `documents` is how I detect changes. Before re-ingesting a file, the pipeline computes SHA-256 of the current content and compares it against the stored hash. If they match, the file hasn't changed and re-ingestion is skipped.

***

## SQLAlchemy 2 Async Setup <a href="#sqlalchemy-setup" id="sqlalchemy-setup"></a>

```python
# src/db/base.py
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
from sqlalchemy.orm import DeclarativeBase
from src.core.config import settings

engine = create_async_engine(
    settings.database_url,          # postgresql+asyncpg://...
    pool_size=10,
    max_overflow=20,
    pool_pre_ping=True,
    echo=settings.db_echo,
)

AsyncSessionLocal = async_sessionmaker(
    bind=engine,
    class_=AsyncSession,
    expire_on_commit=False,
)

class Base(DeclarativeBase):
    pass

async def get_db() -> AsyncSession:
    async with AsyncSessionLocal() as session:
        yield session
```

```python
# src/db/models.py
from __future__ import annotations
from datetime import datetime
from sqlalchemy import BigInteger, Integer, Text, DateTime, ForeignKey, String, UniqueConstraint
from sqlalchemy.orm import Mapped, mapped_column, relationship
from sqlalchemy.sql import func
from pgvector.sqlalchemy import Vector
from src.db.base import Base

class Document(Base):
    __tablename__ = "documents"

    id:           Mapped[int]      = mapped_column(BigInteger, primary_key=True)
    file_path:    Mapped[str]      = mapped_column(Text, unique=True, nullable=False)
    file_hash:    Mapped[str]      = mapped_column(Text, nullable=False)
    title:        Mapped[str | None] = mapped_column(Text)
    source_type:  Mapped[str]      = mapped_column(Text, default="markdown")
    char_count:   Mapped[int | None] = mapped_column(Integer)
    ingested_at:  Mapped[datetime] = mapped_column(DateTime(timezone=True), server_default=func.now())
    updated_at:   Mapped[datetime] = mapped_column(DateTime(timezone=True), server_default=func.now(), onupdate=func.now())

    chunks: Mapped[list[Chunk]] = relationship("Chunk", back_populates="document", cascade="all, delete-orphan")


class Chunk(Base):
    __tablename__ = "chunks"
    __table_args__ = (UniqueConstraint("document_id", "chunk_index"),)

    id:               Mapped[int]        = mapped_column(BigInteger, primary_key=True)
    document_id:      Mapped[int]        = mapped_column(BigInteger, ForeignKey("documents.id", ondelete="CASCADE"), nullable=False)
    chunk_index:      Mapped[int]        = mapped_column(Integer, nullable=False)
    content:          Mapped[str]        = mapped_column(Text, nullable=False)
    tokens:           Mapped[int | None] = mapped_column(Integer)
    embedding:        Mapped[list[float] | None] = mapped_column(Vector(384))
    embedding_model:  Mapped[str]        = mapped_column(String(128), default="all-MiniLM-L6-v2")
    created_at:       Mapped[datetime]   = mapped_column(DateTime(timezone=True), server_default=func.now())

    document: Mapped[Document] = relationship("Document", back_populates="chunks")
```

`pgvector.sqlalchemy.Vector` is the SQLAlchemy column type provided by the `pgvector` Python package. It maps directly to the `vector(n)` PostgreSQL type.

***

## Alembic Migration

The initial migration creates the extension and all tables:

```python
# alembic/versions/0001_initial.py
from alembic import op
import sqlalchemy as sa
from pgvector.sqlalchemy import Vector

revision = '0001'
down_revision = None

def upgrade() -> None:
    # Enable pgvector extension
    op.execute("CREATE EXTENSION IF NOT EXISTS vector")

    op.create_table(
        "documents",
        sa.Column("id",          sa.BigInteger(), primary_key=True),
        sa.Column("file_path",   sa.Text(),   nullable=False, unique=True),
        sa.Column("file_hash",   sa.Text(),   nullable=False),
        sa.Column("title",       sa.Text()),
        sa.Column("source_type", sa.Text(),   server_default="markdown"),
        sa.Column("char_count",  sa.Integer()),
        sa.Column("ingested_at", sa.DateTime(timezone=True), server_default=sa.func.now()),
        sa.Column("updated_at",  sa.DateTime(timezone=True), server_default=sa.func.now()),
    )

    op.create_table(
        "chunks",
        sa.Column("id",              sa.BigInteger(), primary_key=True),
        sa.Column("document_id",     sa.BigInteger(), sa.ForeignKey("documents.id", ondelete="CASCADE"), nullable=False),
        sa.Column("chunk_index",     sa.Integer(),    nullable=False),
        sa.Column("content",         sa.Text(),       nullable=False),
        sa.Column("tokens",          sa.Integer()),
        sa.Column("embedding",       Vector(384)),
        sa.Column("embedding_model", sa.String(128),  server_default="all-MiniLM-L6-v2"),
        sa.Column("created_at",      sa.DateTime(timezone=True), server_default=sa.func.now()),
        sa.UniqueConstraint("document_id", "chunk_index"),
    )

    # HNSW index for vector similarity
    op.execute("""
        CREATE INDEX chunks_embedding_hnsw_idx
        ON chunks USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64)
    """)

    # Full-text index for hybrid search
    op.execute("""
        CREATE INDEX chunks_fts_idx
        ON chunks USING gin(to_tsvector('english', content))
    """)

def downgrade() -> None:
    op.drop_table("chunks")
    op.drop_table("documents")
    op.execute("DROP EXTENSION IF EXISTS vector")
```

Run migrations:

```bash
alembic upgrade head
```

***

## Verifying the Setup

After running migrations and starting the PostgreSQL container, verify everything is working:

```bash
# Connect to the database
docker exec -it rag-postgres-1 psql -U rag -d rag_db

# Check the extension is active
SELECT * FROM pg_extension WHERE extname = 'vector';

# Check the tables exist
\dt

# Check the HNSW index
\di chunks_embedding_hnsw_idx

# Insert a test vector and query it
INSERT INTO documents (file_path, file_hash) VALUES ('test.md', 'abc123');

INSERT INTO chunks (document_id, chunk_index, content, embedding)
VALUES (1, 0, 'This is a test chunk', '[0.1, 0.2, 0.3, ...]'::vector(384));

-- Similarity search (replace with actual 384-dim vector)
SELECT id, content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) as similarity
FROM chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 3;
```

***

## What I Learned

**The HNSW index build is automatic for inserts but requires tuning.** The default `m=16, ef_construction=64` is a reasonable starting point. `m` controls the number of bidirectional links per node — higher means better recall but more memory. `ef_construction` controls build-time search effort — higher means better quality index but slower inserts. For a personal knowledge base I haven't needed to touch these defaults.

**`ef_search` at query time is separate from `ef_construction`.** You can tune recall at query time without rebuilding the index:

```sql
SET hnsw.ef_search = 100;  -- Default is 40
SELECT ... ORDER BY embedding <=> $1 LIMIT 5;
```

Higher `ef_search` = higher recall at the cost of query latency. I run with 100 because I care more about not missing relevant chunks than about shaving milliseconds off the search.

**The full-text index deserves as much thought as the vector index.** The `gin(to_tsvector('english', content))` index is what makes hybrid search possible. I set it up from the start rather than adding it later, because adding a GIN index to a large table is slow and locks the table briefly.

**Dimension is not a free variable.** When I considered switching from `all-MiniLM-L6-v2` (384 dims) to `text-embedding-3-small` (1536 dims), it required a migration to change the `vector(384)` column to `vector(1536)` and re-embed everything. Choosing the embedding model early and sticking with it saves this pain.

***

**Next**: [Article 3 — Document Loading and Chunking Strategies](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/rag-101/rag-101-chunking)
