# Part 4: Embeddings and Vector Search

## The Concept That Changed How I Build Everything

Before I understood embeddings, my search systems were all keyword-based. If a user asked "how to deploy containers" and my documentation said "pushing Docker images to a registry," the system returned nothing. Same concept, different words, zero results.

Embeddings changed that. An embedding converts text into a high-dimensional vector — a list of numbers — that captures the semantic meaning of the text. Similar meanings produce similar vectors, regardless of the exact words used. When I integrated embedding-based search into my personal knowledge base, the quality of retrieval improved dramatically.

This article covers embeddings from the AI engineer's perspective: what they are, how to generate them, how to store and search them, and the practical decisions I made building my own systems.

***

## What Embeddings Actually Represent

An embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text in a geometric space. Texts with similar meanings have vectors that are close together.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "How to deploy a Docker container",
    "Pushing images to a container registry",
    "Making pasta from scratch",
]

embeddings = model.encode(texts)

print(f"Shape: {embeddings.shape}")    # (3, 384)
print(f"First 5 dims: {embeddings[0][:5]}")  # [-0.034, 0.089, ...]
```

Each text becomes a 384-dimensional vector (with `all-MiniLM-L6-v2`). You can't read individual dimensions — they don't correspond to human-understandable features. But you can measure distances between vectors:

```python
from sentence_transformers.util import cos_sim

# Docker deployment vs container registry — semantically related
similarity_related = cos_sim(embeddings[0], embeddings[1])
print(f"Docker ↔ Registry: {similarity_related.item():.3f}")  # ~0.65

# Docker deployment vs pasta — semantically unrelated
similarity_unrelated = cos_sim(embeddings[0], embeddings[2])
print(f"Docker ↔ Pasta: {similarity_unrelated.item():.3f}")    # ~0.05
```

The Docker and container registry texts have high similarity (\~0.65) because they describe related concepts. The Docker and pasta texts have near-zero similarity because they're completely unrelated. This is the core mechanism behind semantic search.

***

## Generating Embeddings

I've used two approaches in my projects: local models and API providers.

### Local Embeddings with sentence-transformers

This is what I use for development and for projects where I want zero API dependency:

```python
# src/ai_engineer/embeddings/local.py
from sentence_transformers import SentenceTransformer


class LocalEmbeddingProvider:
    """Generate embeddings locally with sentence-transformers."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2") -> None:
        self._model = SentenceTransformer(model_name)
        self._dimension = self._model.get_sentence_embedding_dimension()

    @property
    def dimension(self) -> int:
        return self._dimension

    async def embed(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings for a list of texts.

        Note: sentence-transformers is synchronous, so we're running it
        in the calling thread. For production with high concurrency,
        use run_in_executor.
        """
        embeddings = self._model.encode(
            texts,
            normalize_embeddings=True,  # Unit vectors for cosine similarity
            show_progress_bar=False,
        )
        return embeddings.tolist()
```

Key decisions I made here:

* **`normalize_embeddings=True`**: This normalizes vectors to unit length. When vectors are normalized, cosine similarity equals dot product, which is faster to compute. Always normalize.
* **`all-MiniLM-L6-v2`**: A 22M parameter model that produces 384-dimensional embeddings. It's small enough to run on a laptop CPU in milliseconds, and the quality is good enough for most retrieval tasks I've needed.

### API-Based Embeddings

For production systems where I need higher-quality embeddings:

```python
# src/ai_engineer/embeddings/api.py
import httpx
from ai_engineer.config import settings


class APIEmbeddingProvider:
    """Generate embeddings via API (GitHub Models / OpenAI)."""

    def __init__(self) -> None:
        self._client = httpx.AsyncClient(
            base_url="https://models.inference.ai.azure.com",
            headers={"Authorization": f"Bearer {settings.llm_api_key}"},
            timeout=30.0,
        )
        self._model = "text-embedding-3-small"
        self._dimension = 1536

    @property
    def dimension(self) -> int:
        return self._dimension

    async def embed(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings via API."""
        response = await self._client.post(
            "/embeddings",
            json={
                "model": self._model,
                "input": texts,
                "encoding_format": "float",
            },
        )
        response.raise_for_status()
        data = response.json()

        # API returns embeddings in order, but let's be safe
        sorted_data = sorted(data["data"], key=lambda x: x["index"])
        return [item["embedding"] for item in sorted_data]
```

### Choosing Between Local and API Embeddings

| Factor     | Local (`all-MiniLM-L6-v2`) | API (`text-embedding-3-small`) |
| ---------- | -------------------------- | ------------------------------ |
| Dimension  | 384                        | 1536                           |
| Quality    | Good for general retrieval | Better for nuanced similarity  |
| Latency    | <10ms per batch            | 100-300ms per batch (network)  |
| Cost       | Free                       | \~$0.02 per 1M tokens          |
| Offline    | Yes                        | No                             |
| Batch size | Limited by RAM             | Limited by API (2048 texts)    |

For my git-book RAG project, I started with local embeddings during development and switched to API embeddings when I wanted better retrieval quality for production. The provider abstraction from Part 2 made this a config change.

***

## Vector Similarity — How Search Actually Works

Once you have embeddings stored, you need to find the most similar ones to a query. There are three common distance metrics:

### Cosine Similarity

Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical).

```python
import numpy as np


def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))


# When vectors are normalized (unit length), this simplifies to:
def cosine_similarity_normalized(a: list[float], b: list[float]) -> float:
    """Cosine similarity for normalized vectors = dot product."""
    return float(np.dot(a, b))
```

### Euclidean (L2) Distance

Measures straight-line distance between two points. Range: 0 (identical) to ∞.

```python
def euclidean_distance(a: list[float], b: list[float]) -> float:
    """Compute L2 distance between two vectors."""
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.linalg.norm(a_arr - b_arr))
```

### Inner Product (Dot Product)

Measures alignment. For normalized vectors, equals cosine similarity.

```python
def inner_product(a: list[float], b: list[float]) -> float:
    """Compute inner product between two vectors."""
    return float(np.dot(a, b))
```

### Which One to Use?

I use **cosine similarity** (or equivalently, cosine distance) for almost everything. Here's why:

* It's magnitude-independent — it only cares about direction, not vector length
* Most embedding models are designed to be evaluated with cosine similarity
* When vectors are normalized, cosine similarity equals dot product, so you get the speed benefit of dot product with the semantics of cosine

In pgvector, cosine distance is the operator `<=>`:

```sql
-- Find the 5 most similar chunks to a query embedding
SELECT content, 1 - (embedding <=> $1) AS similarity
FROM chunks
ORDER BY embedding <=> $1
LIMIT 5;
```

***

## Storing Vectors with pgvector

I use PostgreSQL with the pgvector extension for vector storage. I considered dedicated vector databases (Pinecone, Weaviate, Qdrant), but for my project scale, pgvector has significant advantages:

* I already use PostgreSQL for application data — no extra infrastructure
* Full SQL capabilities alongside vector search
* ACID transactions — I can insert documents and their embeddings atomically
* Mature tooling — SQLAlchemy, Alembic, pg\_dump all work

### Setting Up pgvector

```bash
# docker-compose.yml already uses pgvector/pgvector:pg16
docker compose up -d

# Connect and enable the extension
docker compose exec postgres psql -U postgres -d ai_engineer -c "CREATE EXTENSION IF NOT EXISTS vector;"
```

### Database Schema

```python
# src/ai_engineer/db/models.py
from sqlalchemy import Column, Integer, String, Text, DateTime, func
from sqlalchemy.orm import DeclarativeBase
from pgvector.sqlalchemy import Vector

from ai_engineer.config import settings


class Base(DeclarativeBase):
    pass


class Document(Base):
    """A source document (e.g., a markdown file)."""

    __tablename__ = "documents"

    id = Column(Integer, primary_key=True)
    title = Column(String(500), nullable=False)
    source_path = Column(String(1000), nullable=False, unique=True)
    created_at = Column(DateTime(timezone=True), server_default=func.now())


class Chunk(Base):
    """A chunk of text from a document, with its embedding."""

    __tablename__ = "chunks"

    id = Column(Integer, primary_key=True)
    document_id = Column(Integer, nullable=False, index=True)
    content = Column(Text, nullable=False)
    chunk_index = Column(Integer, nullable=False)
    embedding = Column(Vector(settings.embedding_dimension), nullable=False)
    created_at = Column(DateTime(timezone=True), server_default=func.now())
```

### Async Database Engine

```python
# src/ai_engineer/db/engine.py
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession

from ai_engineer.config import settings
from ai_engineer.db.models import Base

engine = create_async_engine(settings.database_url, echo=settings.debug)
async_session = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)


async def init_db() -> None:
    """Create tables if they don't exist."""
    async with engine.begin() as conn:
        await conn.run_sync(Base.metadata.create_all)


async def close_db() -> None:
    """Dispose of the engine."""
    await engine.dispose()
```

***

## The Ingestion Pipeline

Before you can search, you need to embed and store your text. Here's the pipeline I built for my knowledge base:

### Step 1: Load Documents

```python
# src/ai_engineer/ingestion/loader.py
from pathlib import Path


def load_markdown_files(directory: str) -> list[dict[str, str]]:
    """Load all markdown files from a directory."""
    documents = []
    base = Path(directory)

    for md_file in sorted(base.rglob("*.md")):
        content = md_file.read_text(encoding="utf-8")
        if len(content.strip()) < 50:
            continue  # Skip near-empty files

        documents.append({
            "title": md_file.stem.replace("-", " ").title(),
            "source_path": str(md_file.relative_to(base)),
            "content": content,
        })

    return documents
```

### Step 2: Chunk Text

Chunking is where most RAG systems go wrong. The goal is to split text into pieces small enough to embed meaningfully but large enough to preserve context.

```python
# src/ai_engineer/ingestion/chunker.py


def chunk_by_headers(content: str, max_chunk_size: int = 1000) -> list[str]:
    """Split markdown content by headers, respecting size limits.

    This is the strategy that worked best for my git-book content.
    Headers create natural semantic boundaries.
    """
    lines = content.split("\n")
    chunks: list[str] = []
    current_chunk: list[str] = []
    current_size = 0

    for line in lines:
        # Check if this line is a header (## or ###)
        is_header = line.startswith("## ") or line.startswith("### ")

        if is_header and current_size > 200:
            # Save current chunk and start a new one
            chunk_text = "\n".join(current_chunk).strip()
            if chunk_text:
                chunks.append(chunk_text)
            current_chunk = [line]
            current_size = len(line)
        else:
            current_chunk.append(line)
            current_size += len(line) + 1

        # Force split if chunk is too large
        if current_size > max_chunk_size:
            chunk_text = "\n".join(current_chunk).strip()
            if chunk_text:
                chunks.append(chunk_text)
            current_chunk = []
            current_size = 0

    # Don't forget the last chunk
    if current_chunk:
        chunk_text = "\n".join(current_chunk).strip()
        if chunk_text:
            chunks.append(chunk_text)

    return chunks
```

I tried three chunking strategies on my content:

| Strategy                 | Recall\@5 | Notes                                 |
| ------------------------ | --------- | ------------------------------------- |
| Fixed size (500 chars)   | 0.52      | Splits mid-sentence, loses context    |
| Recursive text splitting | 0.64      | Better boundaries but still arbitrary |
| Header-based splitting   | 0.78      | Best for structured markdown content  |

Header-based splitting worked best because my git-book articles use consistent heading structure. Each section under an H2 or H3 is a self-contained unit of knowledge.

### Step 3: Embed and Store

```python
# src/ai_engineer/ingestion/pipeline.py
from ai_engineer.db.engine import async_session
from ai_engineer.db.models import Document, Chunk
from ai_engineer.embeddings.local import LocalEmbeddingProvider
from ai_engineer.ingestion.loader import load_markdown_files
from ai_engineer.ingestion.chunker import chunk_by_headers


async def ingest_directory(directory: str) -> dict[str, int]:
    """Load, chunk, embed, and store all markdown files."""
    embedder = LocalEmbeddingProvider()
    documents = load_markdown_files(directory)
    total_chunks = 0

    async with async_session() as session:
        for doc in documents:
            # Create document record
            document = Document(
                title=doc["title"],
                source_path=doc["source_path"],
            )
            session.add(document)
            await session.flush()  # Get the ID

            # Chunk the content
            chunks = chunk_by_headers(doc["content"])

            # Embed all chunks in one batch
            embeddings = await embedder.embed(chunks)

            # Store chunks with embeddings
            for i, (chunk_text, embedding) in enumerate(zip(chunks, embeddings)):
                chunk = Chunk(
                    document_id=document.id,
                    content=chunk_text,
                    chunk_index=i,
                    embedding=embedding,
                )
                session.add(chunk)

            total_chunks += len(chunks)

        await session.commit()

    return {"documents": len(documents), "chunks": total_chunks}
```

***

## Semantic Search

With documents ingested, searching is straightforward:

```python
# src/ai_engineer/retrieval/search.py
from sqlalchemy import text

from ai_engineer.db.engine import async_session
from ai_engineer.embeddings.local import LocalEmbeddingProvider


async def semantic_search(
    query: str,
    top_k: int = 5,
    min_similarity: float = 0.3,
) -> list[dict]:
    """Search for chunks similar to the query."""
    embedder = LocalEmbeddingProvider()

    # Embed the query
    query_embedding = (await embedder.embed([query]))[0]

    # Search pgvector
    async with async_session() as session:
        result = await session.execute(
            text("""
                SELECT
                    c.content,
                    c.chunk_index,
                    d.title,
                    d.source_path,
                    1 - (c.embedding <=> :embedding) AS similarity
                FROM chunks c
                JOIN documents d ON d.id = c.document_id
                WHERE 1 - (c.embedding <=> :embedding) > :min_similarity
                ORDER BY c.embedding <=> :embedding
                LIMIT :top_k
            """),
            {
                "embedding": str(query_embedding),
                "top_k": top_k,
                "min_similarity": min_similarity,
            },
        )

        return [
            {
                "content": row.content,
                "title": row.title,
                "source_path": row.source_path,
                "similarity": round(row.similarity, 4),
            }
            for row in result.fetchall()
        ]
```

### Testing the Search

```python
# scripts/test_search.py
import asyncio
from ai_engineer.retrieval.search import semantic_search


async def main():
    results = await semantic_search("how to set up pgvector with PostgreSQL")
    for r in results:
        print(f"[{r['similarity']:.3f}] {r['title']}")
        print(f"  {r['content'][:100]}...")
        print()


asyncio.run(main())
```

Output from my knowledge base:

```
[0.782] Rag 101 Pgvector Setup
  ## Setting Up pgvector with PostgreSQL 16

  I chose PostgreSQL with pgvector over dedicated vector...

[0.714] Database 101 Indexes And Performance
  ## Understanding Index Types

  PostgreSQL supports several index types. For vector...

[0.651] Vector Database 101 Part 3
  ## Installing pgvector

  pgvector is a PostgreSQL extension that adds vector...
```

The top result is exactly the article about pgvector setup, even though the query and the document use slightly different phrasing.

***

## Indexing for Performance

Without an index, pgvector does a sequential scan — it compares the query vector against every stored vector. This is fine for thousands of chunks but becomes slow at hundreds of thousands.

### IVFFlat Index

```sql
-- Create an IVFFlat index for approximate nearest neighbor search
CREATE INDEX ON chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
```

IVFFlat partitions vectors into clusters (lists). At query time, it only searches a subset of clusters. The `lists` parameter controls the number of partitions — the pgvector docs recommend `sqrt(n)` for up to 1M rows.

### HNSW Index

```sql
-- Create an HNSW index (better recall, more memory)
CREATE INDEX ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```

HNSW (Hierarchical Navigable Small World) is a graph-based index. In my testing:

| Index               | Build Time | Query Time | Recall\@10 | Memory   |
| ------------------- | ---------- | ---------- | ---------- | -------- |
| None (seq scan)     | 0s         | 45ms       | 1.00       | Baseline |
| IVFFlat (100 lists) | 2s         | 3ms        | 0.95       | +15%     |
| HNSW (m=16)         | 8s         | 1ms        | 0.99       | +40%     |

For my knowledge base (\~10k chunks), either index is fast enough. I use HNSW because the recall is better and the query time is consistently low.

***

## Practical Lessons

Things I learned building embedding-based search into my projects:

1. **Normalize your embeddings.** Always set `normalize_embeddings=True` when using sentence-transformers. This makes cosine distance equivalent to dot product, which is faster and avoids bugs where magnitude affects results.
2. **Chunk size matters more than embedding model.** Switching from 500-char fixed chunks to header-based chunks improved my Recall\@5 by 50%. Switching from `all-MiniLM-L6-v2` to `text-embedding-3-small` improved it by 10%. Invest in chunking first.
3. **Set a minimum similarity threshold.** Without it, every query returns results — even completely irrelevant ones. I use 0.3 as a minimum. Below that, it's better to return "I don't have information about this" than to surface noise.
4. **Embed queries and documents with the same model.** This sounds obvious, but I once had a bug where ingestion used `all-MiniLM-L6-v2` and search used `text-embedding-3-small`. The similarity scores were meaningless because the vector spaces were different.
5. **pgvector is good enough.** For sub-million vector collections, PostgreSQL with pgvector handles everything I need. I get SQL joins (fetching document metadata alongside chunks), ACID transactions, and familiar tooling. I'd consider a dedicated vector database if I needed billions of vectors or real-time index updates at scale.

***

**Previous:** [**Part 3 — How LLMs Work**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-3-how-llms-work)

**Next:** [**Part 5 — Prompt Engineering for Production Systems**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-5-prompt-engineering)