Part 4: Embeddings and Vector Search

The Concept That Changed How I Build Everything

Before I understood embeddings, my search systems were all keyword-based. If a user asked "how to deploy containers" and my documentation said "pushing Docker images to a registry," the system returned nothing. Same concept, different words, zero results.

Embeddings changed that. An embedding converts text into a high-dimensional vector — a list of numbers — that captures the semantic meaning of the text. Similar meanings produce similar vectors, regardless of the exact words used. When I integrated embedding-based search into my personal knowledge base, the quality of retrieval improved dramatically.

This article covers embeddings from the AI engineer's perspective: what they are, how to generate them, how to store and search them, and the practical decisions I made building my own systems.

What Embeddings Actually Represent

An embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text in a geometric space. Texts with similar meanings have vectors that are close together.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "How to deploy a Docker container",
    "Pushing images to a container registry",
    "Making pasta from scratch",
]

embeddings = model.encode(texts)

print(f"Shape: {embeddings.shape}")    # (3, 384)
print(f"First 5 dims: {embeddings[0][:5]}")  # [-0.034, 0.089, ...]

Each text becomes a 384-dimensional vector (with all-MiniLM-L6-v2). You can't read individual dimensions — they don't correspond to human-understandable features. But you can measure distances between vectors:

from sentence_transformers.util import cos_sim

# Docker deployment vs container registry — semantically related
similarity_related = cos_sim(embeddings[0], embeddings[1])
print(f"Docker ↔ Registry: {similarity_related.item():.3f}")  # ~0.65

# Docker deployment vs pasta — semantically unrelated
similarity_unrelated = cos_sim(embeddings[0], embeddings[2])
print(f"Docker ↔ Pasta: {similarity_unrelated.item():.3f}")    # ~0.05

The Docker and container registry texts have high similarity (~0.65) because they describe related concepts. The Docker and pasta texts have near-zero similarity because they're completely unrelated. This is the core mechanism behind semantic search.

Generating Embeddings

I've used two approaches in my projects: local models and API providers.

Local Embeddings with sentence-transformers

This is what I use for development and for projects where I want zero API dependency:

# src/ai_engineer/embeddings/local.py
from sentence_transformers import SentenceTransformer


class LocalEmbeddingProvider:
    """Generate embeddings locally with sentence-transformers."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2") -> None:
        self._model = SentenceTransformer(model_name)
        self._dimension = self._model.get_sentence_embedding_dimension()

    @property
    def dimension(self) -> int:
        return self._dimension

    async def embed(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings for a list of texts.

        Note: sentence-transformers is synchronous, so we're running it
        in the calling thread. For production with high concurrency,
        use run_in_executor.
        """
        embeddings = self._model.encode(
            texts,
            normalize_embeddings=True,  # Unit vectors for cosine similarity
            show_progress_bar=False,
        )
        return embeddings.tolist()

Key decisions I made here:

normalize_embeddings=True: This normalizes vectors to unit length. When vectors are normalized, cosine similarity equals dot product, which is faster to compute. Always normalize.
all-MiniLM-L6-v2: A 22M parameter model that produces 384-dimensional embeddings. It's small enough to run on a laptop CPU in milliseconds, and the quality is good enough for most retrieval tasks I've needed.

API-Based Embeddings

For production systems where I need higher-quality embeddings:

# src/ai_engineer/embeddings/api.py
import httpx
from ai_engineer.config import settings


class APIEmbeddingProvider:
    """Generate embeddings via API (GitHub Models / OpenAI)."""

    def __init__(self) -> None:
        self._client = httpx.AsyncClient(
            base_url="https://models.inference.ai.azure.com",
            headers={"Authorization": f"Bearer {settings.llm_api_key}"},
            timeout=30.0,
        )
        self._model = "text-embedding-3-small"
        self._dimension = 1536

    @property
    def dimension(self) -> int:
        return self._dimension

    async def embed(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings via API."""
        response = await self._client.post(
            "/embeddings",
            json={
                "model": self._model,
                "input": texts,
                "encoding_format": "float",
            },
        )
        response.raise_for_status()
        data = response.json()

        # API returns embeddings in order, but let's be safe
        sorted_data = sorted(data["data"], key=lambda x: x["index"])
        return [item["embedding"] for item in sorted_data]

Choosing Between Local and API Embeddings

Factor

Local (all-MiniLM-L6-v2)

API (text-embedding-3-small)

Dimension

384

1536

Quality

Good for general retrieval

Better for nuanced similarity

Latency

<10ms per batch

100-300ms per batch (network)

Cost

Free

~$0.02 per 1M tokens

Offline

Yes

Batch size

Limited by RAM

Limited by API (2048 texts)

For my git-book RAG project, I started with local embeddings during development and switched to API embeddings when I wanted better retrieval quality for production. The provider abstraction from Part 2 made this a config change.

Vector Similarity — How Search Actually Works

Once you have embeddings stored, you need to find the most similar ones to a query. There are three common distance metrics:

Cosine Similarity

Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical).

import numpy as np


def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))


# When vectors are normalized (unit length), this simplifies to:
def cosine_similarity_normalized(a: list[float], b: list[float]) -> float:
    """Cosine similarity for normalized vectors = dot product."""
    return float(np.dot(a, b))

Euclidean (L2) Distance

Measures straight-line distance between two points. Range: 0 (identical) to ∞.

def euclidean_distance(a: list[float], b: list[float]) -> float:
    """Compute L2 distance between two vectors."""
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.linalg.norm(a_arr - b_arr))

Inner Product (Dot Product)

Measures alignment. For normalized vectors, equals cosine similarity.

def inner_product(a: list[float], b: list[float]) -> float:
    """Compute inner product between two vectors."""
    return float(np.dot(a, b))

Which One to Use?

I use cosine similarity (or equivalently, cosine distance) for almost everything. Here's why:

It's magnitude-independent — it only cares about direction, not vector length
Most embedding models are designed to be evaluated with cosine similarity
When vectors are normalized, cosine similarity equals dot product, so you get the speed benefit of dot product with the semantics of cosine

In pgvector, cosine distance is the operator <=>:

-- Find the 5 most similar chunks to a query embedding
SELECT content, 1 - (embedding <=> $1) AS similarity
FROM chunks
ORDER BY embedding <=> $1
LIMIT 5;

Storing Vectors with pgvector

I use PostgreSQL with the pgvector extension for vector storage. I considered dedicated vector databases (Pinecone, Weaviate, Qdrant), but for my project scale, pgvector has significant advantages:

I already use PostgreSQL for application data — no extra infrastructure
Full SQL capabilities alongside vector search
ACID transactions — I can insert documents and their embeddings atomically
Mature tooling — SQLAlchemy, Alembic, pg_dump all work

Setting Up pgvector

# docker-compose.yml already uses pgvector/pgvector:pg16
docker compose up -d

# Connect and enable the extension
docker compose exec postgres psql -U postgres -d ai_engineer -c "CREATE EXTENSION IF NOT EXISTS vector;"

Database Schema

# src/ai_engineer/db/models.py
from sqlalchemy import Column, Integer, String, Text, DateTime, func
from sqlalchemy.orm import DeclarativeBase
from pgvector.sqlalchemy import Vector

from ai_engineer.config import settings


class Base(DeclarativeBase):
    pass


class Document(Base):
    """A source document (e.g., a markdown file)."""

    __tablename__ = "documents"

    id = Column(Integer, primary_key=True)
    title = Column(String(500), nullable=False)
    source_path = Column(String(1000), nullable=False, unique=True)
    created_at = Column(DateTime(timezone=True), server_default=func.now())


class Chunk(Base):
    """A chunk of text from a document, with its embedding."""

    __tablename__ = "chunks"

    id = Column(Integer, primary_key=True)
    document_id = Column(Integer, nullable=False, index=True)
    content = Column(Text, nullable=False)
    chunk_index = Column(Integer, nullable=False)
    embedding = Column(Vector(settings.embedding_dimension), nullable=False)
    created_at = Column(DateTime(timezone=True), server_default=func.now())

Async Database Engine

# src/ai_engineer/db/engine.py
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession

from ai_engineer.config import settings
from ai_engineer.db.models import Base

engine = create_async_engine(settings.database_url, echo=settings.debug)
async_session = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)


async def init_db() -> None:
    """Create tables if they don't exist."""
    async with engine.begin() as conn:
        await conn.run_sync(Base.metadata.create_all)


async def close_db() -> None:
    """Dispose of the engine."""
    await engine.dispose()

The Ingestion Pipeline

Before you can search, you need to embed and store your text. Here's the pipeline I built for my knowledge base:

Step 1: Load Documents

# src/ai_engineer/ingestion/loader.py
from pathlib import Path


def load_markdown_files(directory: str) -> list[dict[str, str]]:
    """Load all markdown files from a directory."""
    documents = []
    base = Path(directory)

    for md_file in sorted(base.rglob("*.md")):
        content = md_file.read_text(encoding="utf-8")
        if len(content.strip()) < 50:
            continue  # Skip near-empty files

        documents.append({
            "title": md_file.stem.replace("-", " ").title(),
            "source_path": str(md_file.relative_to(base)),
            "content": content,
        })

    return documents

Step 2: Chunk Text

Chunking is where most RAG systems go wrong. The goal is to split text into pieces small enough to embed meaningfully but large enough to preserve context.

# src/ai_engineer/ingestion/chunker.py


def chunk_by_headers(content: str, max_chunk_size: int = 1000) -> list[str]:
    """Split markdown content by headers, respecting size limits.

    This is the strategy that worked best for my git-book content.
    Headers create natural semantic boundaries.
    """
    lines = content.split("\n")
    chunks: list[str] = []
    current_chunk: list[str] = []
    current_size = 0

    for line in lines:
        # Check if this line is a header (## or ###)
        is_header = line.startswith("## ") or line.startswith("### ")

        if is_header and current_size > 200:
            # Save current chunk and start a new one
            chunk_text = "\n".join(current_chunk).strip()
            if chunk_text:
                chunks.append(chunk_text)
            current_chunk = [line]
            current_size = len(line)
        else:
            current_chunk.append(line)
            current_size += len(line) + 1

        # Force split if chunk is too large
        if current_size > max_chunk_size:
            chunk_text = "\n".join(current_chunk).strip()
            if chunk_text:
                chunks.append(chunk_text)
            current_chunk = []
            current_size = 0

    # Don't forget the last chunk
    if current_chunk:
        chunk_text = "\n".join(current_chunk).strip()
        if chunk_text:
            chunks.append(chunk_text)

    return chunks

I tried three chunking strategies on my content:

Strategy

Recall@5

Notes

Fixed size (500 chars)

0.52

Splits mid-sentence, loses context

Recursive text splitting

0.64

Better boundaries but still arbitrary

Header-based splitting

0.78

Best for structured markdown content

Header-based splitting worked best because my git-book articles use consistent heading structure. Each section under an H2 or H3 is a self-contained unit of knowledge.

Step 3: Embed and Store

# src/ai_engineer/ingestion/pipeline.py
from ai_engineer.db.engine import async_session
from ai_engineer.db.models import Document, Chunk
from ai_engineer.embeddings.local import LocalEmbeddingProvider
from ai_engineer.ingestion.loader import load_markdown_files
from ai_engineer.ingestion.chunker import chunk_by_headers


async def ingest_directory(directory: str) -> dict[str, int]:
    """Load, chunk, embed, and store all markdown files."""
    embedder = LocalEmbeddingProvider()
    documents = load_markdown_files(directory)
    total_chunks = 0

    async with async_session() as session:
        for doc in documents:
            # Create document record
            document = Document(
                title=doc["title"],
                source_path=doc["source_path"],
            )
            session.add(document)
            await session.flush()  # Get the ID

            # Chunk the content
            chunks = chunk_by_headers(doc["content"])

            # Embed all chunks in one batch
            embeddings = await embedder.embed(chunks)

            # Store chunks with embeddings
            for i, (chunk_text, embedding) in enumerate(zip(chunks, embeddings)):
                chunk = Chunk(
                    document_id=document.id,
                    content=chunk_text,
                    chunk_index=i,
                    embedding=embedding,
                )
                session.add(chunk)

            total_chunks += len(chunks)

        await session.commit()

    return {"documents": len(documents), "chunks": total_chunks}

Semantic Search

With documents ingested, searching is straightforward:

# src/ai_engineer/retrieval/search.py
from sqlalchemy import text

from ai_engineer.db.engine import async_session
from ai_engineer.embeddings.local import LocalEmbeddingProvider


async def semantic_search(
    query: str,
    top_k: int = 5,
    min_similarity: float = 0.3,
) -> list[dict]:
    """Search for chunks similar to the query."""
    embedder = LocalEmbeddingProvider()

    # Embed the query
    query_embedding = (await embedder.embed([query]))[0]

    # Search pgvector
    async with async_session() as session:
        result = await session.execute(
            text("""
                SELECT
                    c.content,
                    c.chunk_index,
                    d.title,
                    d.source_path,
                    1 - (c.embedding <=> :embedding) AS similarity
                FROM chunks c
                JOIN documents d ON d.id = c.document_id
                WHERE 1 - (c.embedding <=> :embedding) > :min_similarity
                ORDER BY c.embedding <=> :embedding
                LIMIT :top_k
            """),
            {
                "embedding": str(query_embedding),
                "top_k": top_k,
                "min_similarity": min_similarity,
            },
        )

        return [
            {
                "content": row.content,
                "title": row.title,
                "source_path": row.source_path,
                "similarity": round(row.similarity, 4),
            }
            for row in result.fetchall()
        ]

Testing the Search

# scripts/test_search.py
import asyncio
from ai_engineer.retrieval.search import semantic_search


async def main():
    results = await semantic_search("how to set up pgvector with PostgreSQL")
    for r in results:
        print(f"[{r['similarity']:.3f}] {r['title']}")
        print(f"  {r['content'][:100]}...")
        print()


asyncio.run(main())

Output from my knowledge base:

[0.782] Rag 101 Pgvector Setup
  ## Setting Up pgvector with PostgreSQL 16

  I chose PostgreSQL with pgvector over dedicated vector...

[0.714] Database 101 Indexes And Performance
  ## Understanding Index Types

  PostgreSQL supports several index types. For vector...

[0.651] Vector Database 101 Part 3
  ## Installing pgvector

  pgvector is a PostgreSQL extension that adds vector...

The top result is exactly the article about pgvector setup, even though the query and the document use slightly different phrasing.

Indexing for Performance

Without an index, pgvector does a sequential scan — it compares the query vector against every stored vector. This is fine for thousands of chunks but becomes slow at hundreds of thousands.

IVFFlat Index

-- Create an IVFFlat index for approximate nearest neighbor search
CREATE INDEX ON chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

IVFFlat partitions vectors into clusters (lists). At query time, it only searches a subset of clusters. The lists parameter controls the number of partitions — the pgvector docs recommend sqrt(n) for up to 1M rows.

HNSW Index

-- Create an HNSW index (better recall, more memory)
CREATE INDEX ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

HNSW (Hierarchical Navigable Small World) is a graph-based index. In my testing:

Index

Build Time

Query Time

Recall@10

Memory

None (seq scan)

45ms

1.00

Baseline

IVFFlat (100 lists)

3ms

0.95

+15%

HNSW (m=16)

1ms

0.99

+40%

For my knowledge base (~10k chunks), either index is fast enough. I use HNSW because the recall is better and the query time is consistently low.

Practical Lessons

Things I learned building embedding-based search into my projects:

Normalize your embeddings. Always set normalize_embeddings=True when using sentence-transformers. This makes cosine distance equivalent to dot product, which is faster and avoids bugs where magnitude affects results.
Chunk size matters more than embedding model. Switching from 500-char fixed chunks to header-based chunks improved my Recall@5 by 50%. Switching from all-MiniLM-L6-v2 to text-embedding-3-small improved it by 10%. Invest in chunking first.
Set a minimum similarity threshold. Without it, every query returns results — even completely irrelevant ones. I use 0.3 as a minimum. Below that, it's better to return "I don't have information about this" than to surface noise.
Embed queries and documents with the same model. This sounds obvious, but I once had a bug where ingestion used all-MiniLM-L6-v2 and search used text-embedding-3-small. The similarity scores were meaningless because the vector spaces were different.
pgvector is good enough. For sub-million vector collections, PostgreSQL with pgvector handles everything I need. I get SQL joins (fetching document metadata alongside chunks), ACID transactions, and familiar tooling. I'd consider a dedicated vector database if I needed billions of vectors or real-time index updates at scale.

Previous: Part 3 — How LLMs Work

Next: Part 5 — Prompt Engineering for Production Systems

PreviousPart 3: How LLMs Work — A Practical Guide NextPart 5: Prompt Engineering for Production Systems

Last updated 7 hours ago

hashtagThe Concept That Changed How I Build Everything

hashtagWhat Embeddings Actually Represent

hashtagGenerating Embeddings

hashtagLocal Embeddings with sentence-transformers

hashtagAPI-Based Embeddings

hashtagChoosing Between Local and API Embeddings

hashtagVector Similarity — How Search Actually Works

hashtagCosine Similarity

hashtagEuclidean (L2) Distance

hashtagInner Product (Dot Product)

hashtagWhich One to Use?

hashtagStoring Vectors with pgvector

hashtagSetting Up pgvector

hashtagDatabase Schema

hashtagAsync Database Engine

hashtagThe Ingestion Pipeline

hashtagStep 1: Load Documents

hashtagStep 2: Chunk Text

hashtagStep 3: Embed and Store

hashtagSemantic Search

hashtagTesting the Search

hashtagIndexing for Performance

hashtagIVFFlat Index

hashtagHNSW Index

hashtagPractical Lessons