Part 7: Evaluating and Testing AI Systems

The Hardest Part of AI Engineering

Testing traditional software is straightforward: given input X, expect output Y. Testing AI systems is fundamentally different. When I ask my RAG service "What is pgvector?", there's no single correct answer. A good answer might mention PostgreSQL extensions, vector types, and similarity search. A different good answer might focus on installation and indexing. Both are correct, but they're different strings.

This was the most frustrating part of my AI engineering journey. I'd make a change to a prompt or switch embedding models and have no systematic way to know if the system got better or worse. I was relying on "it looks about right" — which is fine for a personal project but terrible engineering practice.

This article documents the evaluation approach I developed for my own projects. It's not perfect, but it's dramatically better than manual spot-checking.

Two Types of Testing

AI systems have both deterministic and non-deterministic components. I test them with different strategies:

┌─────────────────────────────────────────────┐
│            AI System                         │
│                                              │
│  ┌─────────────────┐  ┌──────────────────┐  │
│  │  Deterministic   │  │ Non-Deterministic │  │
│  │                  │  │                   │  │
│  │ • Config loading │  │ • LLM responses   │  │
│  │ • Input validation│ │ • Retrieval ranking│  │
│  │ • Prompt building│  │ • Answer quality   │  │
│  │ • Token counting │  │ • Relevance scores │  │
│  │ • Output parsing │  │                   │  │
│  │                  │  │                   │  │
│  │  → Unit tests    │  │  → Evaluations    │  │
│  │  (pytest)        │  │  (eval framework) │  │
│  └─────────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────┘

Unit Testing the Deterministic Parts

These tests are fast, free (no API calls), and run in CI:

# tests/test_chunker.py
from ai_engineer.ingestion.chunker import chunk_by_headers


def test_chunk_by_headers_basic():
    content = """# Title

Introduction paragraph.

## Section One

Content for section one.

## Section Two

Content for section two.
"""
    chunks = chunk_by_headers(content)
    assert len(chunks) >= 2
    assert "Section One" in chunks[0] or "Section One" in chunks[1]


def test_chunk_by_headers_respects_max_size():
    # Create content with a very long section
    long_section = "word " * 500
    content = f"## Header\n\n{long_section}"
    chunks = chunk_by_headers(content, max_chunk_size=500)
    assert all(len(c) <= 600 for c in chunks)  # Allow slight overflow at boundaries


def test_chunk_by_headers_empty_content():
    chunks = chunk_by_headers("")
    assert chunks == []

# tests/test_tokens.py
from ai_engineer.llm.tokens import count_tokens, truncate_to_tokens


def test_count_tokens_simple():
    count = count_tokens("Hello, world!")
    assert 2 <= count <= 5  # Exact count depends on tokenizer version


def test_truncate_preserves_short_text():
    text = "Short text"
    result = truncate_to_tokens(text, max_tokens=100)
    assert result == text


def test_truncate_cuts_long_text():
    text = "word " * 1000  # ~1000 tokens
    result = truncate_to_tokens(text, max_tokens=50)
    assert count_tokens(result) <= 50

# tests/test_prompt_builder.py
from ai_engineer.prompts.templates import PromptBuilder, RAGPromptInput


def test_prompt_includes_context():
    input_data = RAGPromptInput(
        question="What is pgvector?",
        context_chunks=[
            {"title": "pgvector Guide", "content": "pgvector is a PostgreSQL extension for vectors."}
        ],
    )
    messages = PromptBuilder.build_rag_prompt(input_data)
    user_message = messages[-1]["content"]

    assert "pgvector" in user_message
    assert "What is pgvector?" in user_message
    assert "Source 1" in user_message


def test_prompt_respects_token_budget():
    # Create chunks that exceed the token budget
    big_chunks = [
        {"title": f"Doc {i}", "content": "word " * 2000}
        for i in range(10)
    ]
    input_data = RAGPromptInput(
        question="Test question",
        context_chunks=big_chunks,
        max_context_tokens=1000,
    )
    messages = PromptBuilder.build_rag_prompt(input_data)
    user_message = messages[-1]["content"]

    # Should not include all 10 chunks
    assert user_message.count("[Source") < 10

Building an Evaluation Dataset

The foundation of AI evaluation is a dataset of questions with expected answers. I built mine from my own usage of the RAG system:

# eval/dataset.py
from pydantic import BaseModel, Field


class EvalExample(BaseModel):
    """A single evaluation example."""

    question: str
    expected_answer: str = Field(
        ..., description="A reference answer — doesn't need to be exact"
    )
    expected_sources: list[str] = Field(
        default_factory=list,
        description="Expected source document titles",
    )
    category: str = Field(
        default="general",
        description="Category for grouped metrics: general, technical, conceptual",
    )


class EvalDataset(BaseModel):
    """A collection of evaluation examples."""

    name: str
    examples: list[EvalExample]


# My evaluation dataset — built from questions I actually asked
RAG_EVAL_DATASET = EvalDataset(
    name="rag-eval-v1",
    examples=[
        EvalExample(
            question="How do I set up pgvector with PostgreSQL?",
            expected_answer="Install the pgvector extension on PostgreSQL 16 using the pgvector/pgvector Docker image. Enable it with CREATE EXTENSION vector. Create tables with vector columns specifying the dimension.",
            expected_sources=["Rag 101 Pgvector Setup"],
            category="technical",
        ),
        EvalExample(
            question="What is the difference between cosine and L2 distance?",
            expected_answer="Cosine distance measures the angle between vectors and is magnitude-independent. L2 (Euclidean) distance measures straight-line distance. For normalized vectors, cosine distance and dot product are equivalent.",
            expected_sources=["Vector Database 101"],
            category="conceptual",
        ),
        EvalExample(
            question="How do I deploy a FastAPI service to production?",
            expected_answer="Use uvicorn with gunicorn as the process manager. Deploy with Docker. Set up health checks, structured logging, and configure CORS.",
            expected_sources=["Rag 101 Fastapi Service"],
            category="technical",
        ),
        EvalExample(
            question="What is the capital of France?",
            expected_answer="I don't have information about this in my knowledge base.",
            expected_sources=[],
            category="out-of-scope",
        ),
        EvalExample(
            question="What chunking strategy works best for markdown?",
            expected_answer="Header-based chunking that splits on H2/H3 boundaries works best for structured markdown. It preserves semantic units better than fixed-size chunking.",
            expected_sources=["Rag 101 Chunking"],
            category="technical",
        ),
    ],
)

How I Build Evaluation Datasets

I don't invent test cases. I collect them from real usage:

Log every question. My RAG service logs every question it receives. After a week of usage, I had 50+ real questions.
Write reference answers for a subset. I manually wrote reference answers for 20-30 diverse questions. This takes time but is essential.
Categorize by difficulty. Some questions are straightforward retrieval ("How do I install X?"), some are conceptual ("What's the difference between X and Y?"), and some are out-of-scope. Each category reveals different failure modes.
Include negative examples. Questions the system should refuse to answer are just as important as questions it should answer.

Retrieval Evaluation

Before evaluating the full system, I evaluate retrieval separately. If retrieval returns irrelevant chunks, no prompt engineering will save the answer quality.

# eval/retrieval_eval.py
from ai_engineer.retrieval.search import semantic_search
from eval.dataset import EvalDataset


async def evaluate_retrieval(
    dataset: EvalDataset,
    top_k: int = 5,
) -> dict:
    """Evaluate retrieval quality: are the right sources being found?"""
    results = {
        "total": len(dataset.examples),
        "hits": 0,
        "misses": 0,
        "recall_at_k": 0.0,
        "details": [],
    }

    for example in dataset.examples:
        if not example.expected_sources:
            continue  # Skip out-of-scope examples

        # Run retrieval
        chunks = await semantic_search(
            query=example.question,
            top_k=top_k,
        )

        # Check if expected sources appear in results
        retrieved_titles = {c["title"] for c in chunks}
        expected_set = set(example.expected_sources)

        found = expected_set & retrieved_titles
        hit = len(found) > 0

        if hit:
            results["hits"] += 1
        else:
            results["misses"] += 1

        results["details"].append({
            "question": example.question,
            "expected": list(expected_set),
            "retrieved": list(retrieved_titles),
            "hit": hit,
            "top_similarity": chunks[0]["similarity"] if chunks else 0,
        })

    evaluated = results["hits"] + results["misses"]
    if evaluated > 0:
        results["recall_at_k"] = results["hits"] / evaluated

    return results

Running this on my dataset:

# scripts/run_retrieval_eval.py
import asyncio
from eval.retrieval_eval import evaluate_retrieval
from eval.dataset import RAG_EVAL_DATASET


async def main():
    results = await evaluate_retrieval(RAG_EVAL_DATASET, top_k=5)
    print(f"Recall@5: {results['recall_at_k']:.2%}")
    print(f"Hits: {results['hits']}, Misses: {results['misses']}")
    for detail in results["details"]:
        status = "✓" if detail["hit"] else "✗"
        print(f"  {status} {detail['question'][:60]}...")
        if not detail["hit"]:
            print(f"    Expected: {detail['expected']}")
            print(f"    Retrieved: {detail['retrieved']}")


asyncio.run(main())

Output:

Recall@5: 85.00%
Hits: 17, Misses: 3
  ✓ How do I set up pgvector with PostgreSQL?...
  ✓ What is the difference between cosine and L2 distance?...
  ✗ How do I configure Alembic migrations?...
    Expected: ['Rag 101 Pgvector Setup']
    Retrieved: ['Database 101 Migrations', 'Orm 101 Part 3']

When recall drops, I investigate: is it a chunking problem (relevant content split across chunks)? An embedding quality problem? Or a question that's genuinely hard to match?

LLM-as-Judge Evaluation

For evaluating answer quality, I use a technique called LLM-as-judge: I ask a second LLM to evaluate the output of the first.

# eval/llm_judge.py
import json
from pydantic import BaseModel, Field

from ai_engineer.llm.base import LLMProvider


class JudgeScore(BaseModel):
    """Structured output from the LLM judge."""

    relevance: int = Field(..., ge=1, le=5, description="How relevant is the answer to the question?")
    accuracy: int = Field(..., ge=1, le=5, description="How accurate is the answer compared to the reference?")
    completeness: int = Field(..., ge=1, le=5, description="Does the answer cover the key points?")
    reasoning: str = Field(..., description="Brief explanation of the scores")


JUDGE_PROMPT = """You are evaluating the quality of an AI-generated answer.

Question: {question}

Reference Answer: {reference}

Generated Answer: {generated}

Rate the generated answer on three dimensions (1-5 scale):
1. Relevance: Does it address the question? (1=completely off-topic, 5=directly answers)
2. Accuracy: Is it consistent with the reference? (1=contradicts reference, 5=fully consistent)
3. Completeness: Does it cover the key points? (1=missing everything, 5=covers all key points)

Respond with ONLY a JSON object:
{{"relevance": <1-5>, "accuracy": <1-5>, "completeness": <1-5>, "reasoning": "<brief explanation>"}}"""


async def judge_answer(
    question: str,
    reference_answer: str,
    generated_answer: str,
    judge_provider: LLMProvider,
) -> JudgeScore:
    """Use an LLM to evaluate answer quality."""
    prompt = JUDGE_PROMPT.format(
        question=question,
        reference=reference_answer,
        generated=generated_answer,
    )

    raw = await judge_provider.generate(prompt, temperature=0.0, max_tokens=200)

    # Parse and validate
    cleaned = raw.strip()
    if cleaned.startswith("```"):
        lines = cleaned.split("\n")
        cleaned = "\n".join(lines[1:-1])

    data = json.loads(cleaned)
    return JudgeScore.model_validate(data)

Full Evaluation Pipeline

# eval/run_eval.py
import asyncio
from ai_engineer.retrieval.search import semantic_search
from ai_engineer.prompts.templates import PromptBuilder, RAGPromptInput
from ai_engineer.llm.factory import create_llm_provider
from eval.llm_judge import judge_answer
from eval.dataset import RAG_EVAL_DATASET


async def run_full_evaluation():
    """Run end-to-end evaluation on the dataset."""
    provider = create_llm_provider()
    results = []

    for example in RAG_EVAL_DATASET.examples:
        # Step 1: Retrieve
        chunks = await semantic_search(query=example.question, top_k=5)

        # Step 2: Generate
        if chunks:
            prompt_input = RAGPromptInput(
                question=example.question,
                context_chunks=chunks,
            )
            messages = PromptBuilder.build_rag_prompt(prompt_input)
            generated = await provider.generate(
                messages[-1]["content"],
                temperature=0.1,
            )
        else:
            generated = "I don't have information about this in my knowledge base."

        # Step 3: Judge
        score = await judge_answer(
            question=example.question,
            reference_answer=example.expected_answer,
            generated_answer=generated,
            judge_provider=provider,
        )

        results.append({
            "question": example.question,
            "category": example.category,
            "generated": generated[:200],
            "relevance": score.relevance,
            "accuracy": score.accuracy,
            "completeness": score.completeness,
            "avg_score": (score.relevance + score.accuracy + score.completeness) / 3,
            "reasoning": score.reasoning,
        })

    # Aggregate results
    avg_relevance = sum(r["relevance"] for r in results) / len(results)
    avg_accuracy = sum(r["accuracy"] for r in results) / len(results)
    avg_completeness = sum(r["completeness"] for r in results) / len(results)

    print(f"\n{'='*60}")
    print(f"Evaluation Results ({len(results)} examples)")
    print(f"{'='*60}")
    print(f"Avg Relevance:    {avg_relevance:.2f} / 5.0")
    print(f"Avg Accuracy:     {avg_accuracy:.2f} / 5.0")
    print(f"Avg Completeness: {avg_completeness:.2f} / 5.0")
    print(f"{'='*60}")

    # Show failures (avg score < 3.0)
    failures = [r for r in results if r["avg_score"] < 3.0]
    if failures:
        print(f"\nLow-scoring answers ({len(failures)}):")
        for f in failures:
            print(f"  [{f['avg_score']:.1f}] {f['question'][:60]}...")
            print(f"       Reason: {f['reasoning']}")

    return results


asyncio.run(run_full_evaluation())

Output from a typical run:

============================================================
Evaluation Results (5 examples)
============================================================
Avg Relevance:    4.40 / 5.0
Avg Accuracy:     4.00 / 5.0
Avg Completeness: 3.80 / 5.0
============================================================

Low-scoring answers (0):

Regression Testing

The most valuable use of evaluation is catching regressions. When I change a prompt, swap an embedding model, or update chunking logic, I run the evaluation suite and compare:

# eval/regression.py
import json
from pathlib import Path
from datetime import datetime


def save_eval_results(results: list[dict], label: str) -> str:
    """Save evaluation results with a timestamp for comparison."""
    output_dir = Path("eval/results")
    output_dir.mkdir(parents=True, exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{label}_{timestamp}.json"
    filepath = output_dir / filename

    filepath.write_text(json.dumps(results, indent=2))
    return str(filepath)


def compare_results(baseline_path: str, current_path: str) -> dict:
    """Compare two evaluation runs and report regressions."""
    baseline = json.loads(Path(baseline_path).read_text())
    current = json.loads(Path(current_path).read_text())

    baseline_by_q = {r["question"]: r for r in baseline}
    current_by_q = {r["question"]: r for r in current}

    regressions = []
    improvements = []

    for question in baseline_by_q:
        if question not in current_by_q:
            continue

        b = baseline_by_q[question]
        c = current_by_q[question]

        diff = c["avg_score"] - b["avg_score"]
        if diff < -0.5:
            regressions.append({
                "question": question,
                "baseline_score": b["avg_score"],
                "current_score": c["avg_score"],
                "diff": round(diff, 2),
            })
        elif diff > 0.5:
            improvements.append({
                "question": question,
                "baseline_score": b["avg_score"],
                "current_score": c["avg_score"],
                "diff": round(diff, 2),
            })

    return {
        "regressions": regressions,
        "improvements": improvements,
        "baseline_avg": sum(r["avg_score"] for r in baseline) / len(baseline),
        "current_avg": sum(r["avg_score"] for r in current) / len(current),
    }

My workflow:

Run evaluation before making changes → save as baseline
Make the change (new prompt, different model, chunking update)
Run evaluation again → save as current
Compare: if regressions > improvements, reconsider the change

Testing in CI

Not all evaluation needs LLM calls. I run the deterministic tests in CI on every push, and the full evaluation suite nightly:

# .github/workflows/test.yml
name: Tests

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync --dev
      - run: uv run pytest tests/ -v --tb=short
      - run: uv run ruff check src/
      - run: uv run mypy src/

  # Nightly evaluation — uses LLM API
  evaluation:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run python eval/run_eval.py
        env:
          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}

Key Takeaways

Separate deterministic and non-deterministic tests. Unit test everything that doesn't touch an LLM. Evaluate everything that does.
Build evaluation datasets from real usage. Don't invent test cases — collect them from actual questions your system receives.
Evaluate retrieval separately from generation. If retrieval is broken, fixing prompts won't help. Isolate the problem.
LLM-as-judge is practical and effective. It's not perfect, but it's vastly better than manual review for regression detection.
Run evaluations before and after every change. Prompt changes, model swaps, and chunking updates can all cause regressions. Measure, don't guess.

Previous: Part 6 — Building AI-Powered APIs with FastAPI

Next: Part 8 — AI Engineering in Production

PreviousPart 6: Building AI-Powered APIs with FastAPI NextPart 8: AI Engineering in Production

Last updated 7 hours ago

hashtagThe Hardest Part of AI Engineering

hashtagTwo Types of Testing

hashtagUnit Testing the Deterministic Parts

hashtagBuilding an Evaluation Dataset

hashtagHow I Build Evaluation Datasets

hashtagRetrieval Evaluation

hashtagLLM-as-Judge Evaluation

hashtagFull Evaluation Pipeline

hashtagRegression Testing

hashtagTesting in CI

hashtagKey Takeaways