# Part 7: Evaluating and Testing AI Systems

## The Hardest Part of AI Engineering

Testing traditional software is straightforward: given input X, expect output Y. Testing AI systems is fundamentally different. When I ask my RAG service "What is pgvector?", there's no single correct answer. A good answer might mention PostgreSQL extensions, vector types, and similarity search. A different good answer might focus on installation and indexing. Both are correct, but they're different strings.

This was the most frustrating part of my AI engineering journey. I'd make a change to a prompt or switch embedding models and have no systematic way to know if the system got better or worse. I was relying on "it looks about right" — which is fine for a personal project but terrible engineering practice.

This article documents the evaluation approach I developed for my own projects. It's not perfect, but it's dramatically better than manual spot-checking.

***

## Two Types of Testing

AI systems have both deterministic and non-deterministic components. I test them with different strategies:

```
┌─────────────────────────────────────────────┐
│            AI System                         │
│                                              │
│  ┌─────────────────┐  ┌──────────────────┐  │
│  │  Deterministic   │  │ Non-Deterministic │  │
│  │                  │  │                   │  │
│  │ • Config loading │  │ • LLM responses   │  │
│  │ • Input validation│ │ • Retrieval ranking│  │
│  │ • Prompt building│  │ • Answer quality   │  │
│  │ • Token counting │  │ • Relevance scores │  │
│  │ • Output parsing │  │                   │  │
│  │                  │  │                   │  │
│  │  → Unit tests    │  │  → Evaluations    │  │
│  │  (pytest)        │  │  (eval framework) │  │
│  └─────────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────┘
```

***

## Unit Testing the Deterministic Parts

These tests are fast, free (no API calls), and run in CI:

```python
# tests/test_chunker.py
from ai_engineer.ingestion.chunker import chunk_by_headers


def test_chunk_by_headers_basic():
    content = """# Title

Introduction paragraph.

## Section One

Content for section one.

## Section Two

Content for section two.
"""
    chunks = chunk_by_headers(content)
    assert len(chunks) >= 2
    assert "Section One" in chunks[0] or "Section One" in chunks[1]


def test_chunk_by_headers_respects_max_size():
    # Create content with a very long section
    long_section = "word " * 500
    content = f"## Header\n\n{long_section}"
    chunks = chunk_by_headers(content, max_chunk_size=500)
    assert all(len(c) <= 600 for c in chunks)  # Allow slight overflow at boundaries


def test_chunk_by_headers_empty_content():
    chunks = chunk_by_headers("")
    assert chunks == []
```

```python
# tests/test_tokens.py
from ai_engineer.llm.tokens import count_tokens, truncate_to_tokens


def test_count_tokens_simple():
    count = count_tokens("Hello, world!")
    assert 2 <= count <= 5  # Exact count depends on tokenizer version


def test_truncate_preserves_short_text():
    text = "Short text"
    result = truncate_to_tokens(text, max_tokens=100)
    assert result == text


def test_truncate_cuts_long_text():
    text = "word " * 1000  # ~1000 tokens
    result = truncate_to_tokens(text, max_tokens=50)
    assert count_tokens(result) <= 50
```

```python
# tests/test_prompt_builder.py
from ai_engineer.prompts.templates import PromptBuilder, RAGPromptInput


def test_prompt_includes_context():
    input_data = RAGPromptInput(
        question="What is pgvector?",
        context_chunks=[
            {"title": "pgvector Guide", "content": "pgvector is a PostgreSQL extension for vectors."}
        ],
    )
    messages = PromptBuilder.build_rag_prompt(input_data)
    user_message = messages[-1]["content"]

    assert "pgvector" in user_message
    assert "What is pgvector?" in user_message
    assert "Source 1" in user_message


def test_prompt_respects_token_budget():
    # Create chunks that exceed the token budget
    big_chunks = [
        {"title": f"Doc {i}", "content": "word " * 2000}
        for i in range(10)
    ]
    input_data = RAGPromptInput(
        question="Test question",
        context_chunks=big_chunks,
        max_context_tokens=1000,
    )
    messages = PromptBuilder.build_rag_prompt(input_data)
    user_message = messages[-1]["content"]

    # Should not include all 10 chunks
    assert user_message.count("[Source") < 10
```

***

## Building an Evaluation Dataset

The foundation of AI evaluation is a dataset of questions with expected answers. I built mine from my own usage of the RAG system:

```python
# eval/dataset.py
from pydantic import BaseModel, Field


class EvalExample(BaseModel):
    """A single evaluation example."""

    question: str
    expected_answer: str = Field(
        ..., description="A reference answer — doesn't need to be exact"
    )
    expected_sources: list[str] = Field(
        default_factory=list,
        description="Expected source document titles",
    )
    category: str = Field(
        default="general",
        description="Category for grouped metrics: general, technical, conceptual",
    )


class EvalDataset(BaseModel):
    """A collection of evaluation examples."""

    name: str
    examples: list[EvalExample]


# My evaluation dataset — built from questions I actually asked
RAG_EVAL_DATASET = EvalDataset(
    name="rag-eval-v1",
    examples=[
        EvalExample(
            question="How do I set up pgvector with PostgreSQL?",
            expected_answer="Install the pgvector extension on PostgreSQL 16 using the pgvector/pgvector Docker image. Enable it with CREATE EXTENSION vector. Create tables with vector columns specifying the dimension.",
            expected_sources=["Rag 101 Pgvector Setup"],
            category="technical",
        ),
        EvalExample(
            question="What is the difference between cosine and L2 distance?",
            expected_answer="Cosine distance measures the angle between vectors and is magnitude-independent. L2 (Euclidean) distance measures straight-line distance. For normalized vectors, cosine distance and dot product are equivalent.",
            expected_sources=["Vector Database 101"],
            category="conceptual",
        ),
        EvalExample(
            question="How do I deploy a FastAPI service to production?",
            expected_answer="Use uvicorn with gunicorn as the process manager. Deploy with Docker. Set up health checks, structured logging, and configure CORS.",
            expected_sources=["Rag 101 Fastapi Service"],
            category="technical",
        ),
        EvalExample(
            question="What is the capital of France?",
            expected_answer="I don't have information about this in my knowledge base.",
            expected_sources=[],
            category="out-of-scope",
        ),
        EvalExample(
            question="What chunking strategy works best for markdown?",
            expected_answer="Header-based chunking that splits on H2/H3 boundaries works best for structured markdown. It preserves semantic units better than fixed-size chunking.",
            expected_sources=["Rag 101 Chunking"],
            category="technical",
        ),
    ],
)
```

### How I Build Evaluation Datasets

I don't invent test cases. I collect them from real usage:

1. **Log every question.** My RAG service logs every question it receives. After a week of usage, I had 50+ real questions.
2. **Write reference answers for a subset.** I manually wrote reference answers for 20-30 diverse questions. This takes time but is essential.
3. **Categorize by difficulty.** Some questions are straightforward retrieval ("How do I install X?"), some are conceptual ("What's the difference between X and Y?"), and some are out-of-scope. Each category reveals different failure modes.
4. **Include negative examples.** Questions the system should refuse to answer are just as important as questions it should answer.

***

## Retrieval Evaluation

Before evaluating the full system, I evaluate retrieval separately. If retrieval returns irrelevant chunks, no prompt engineering will save the answer quality.

```python
# eval/retrieval_eval.py
from ai_engineer.retrieval.search import semantic_search
from eval.dataset import EvalDataset


async def evaluate_retrieval(
    dataset: EvalDataset,
    top_k: int = 5,
) -> dict:
    """Evaluate retrieval quality: are the right sources being found?"""
    results = {
        "total": len(dataset.examples),
        "hits": 0,
        "misses": 0,
        "recall_at_k": 0.0,
        "details": [],
    }

    for example in dataset.examples:
        if not example.expected_sources:
            continue  # Skip out-of-scope examples

        # Run retrieval
        chunks = await semantic_search(
            query=example.question,
            top_k=top_k,
        )

        # Check if expected sources appear in results
        retrieved_titles = {c["title"] for c in chunks}
        expected_set = set(example.expected_sources)

        found = expected_set & retrieved_titles
        hit = len(found) > 0

        if hit:
            results["hits"] += 1
        else:
            results["misses"] += 1

        results["details"].append({
            "question": example.question,
            "expected": list(expected_set),
            "retrieved": list(retrieved_titles),
            "hit": hit,
            "top_similarity": chunks[0]["similarity"] if chunks else 0,
        })

    evaluated = results["hits"] + results["misses"]
    if evaluated > 0:
        results["recall_at_k"] = results["hits"] / evaluated

    return results
```

Running this on my dataset:

```python
# scripts/run_retrieval_eval.py
import asyncio
from eval.retrieval_eval import evaluate_retrieval
from eval.dataset import RAG_EVAL_DATASET


async def main():
    results = await evaluate_retrieval(RAG_EVAL_DATASET, top_k=5)
    print(f"Recall@5: {results['recall_at_k']:.2%}")
    print(f"Hits: {results['hits']}, Misses: {results['misses']}")
    for detail in results["details"]:
        status = "✓" if detail["hit"] else "✗"
        print(f"  {status} {detail['question'][:60]}...")
        if not detail["hit"]:
            print(f"    Expected: {detail['expected']}")
            print(f"    Retrieved: {detail['retrieved']}")


asyncio.run(main())
```

Output:

```
Recall@5: 85.00%
Hits: 17, Misses: 3
  ✓ How do I set up pgvector with PostgreSQL?...
  ✓ What is the difference between cosine and L2 distance?...
  ✗ How do I configure Alembic migrations?...
    Expected: ['Rag 101 Pgvector Setup']
    Retrieved: ['Database 101 Migrations', 'Orm 101 Part 3']
```

When recall drops, I investigate: is it a chunking problem (relevant content split across chunks)? An embedding quality problem? Or a question that's genuinely hard to match?

***

## LLM-as-Judge Evaluation

For evaluating answer quality, I use a technique called LLM-as-judge: I ask a second LLM to evaluate the output of the first.

````python
# eval/llm_judge.py
import json
from pydantic import BaseModel, Field

from ai_engineer.llm.base import LLMProvider


class JudgeScore(BaseModel):
    """Structured output from the LLM judge."""

    relevance: int = Field(..., ge=1, le=5, description="How relevant is the answer to the question?")
    accuracy: int = Field(..., ge=1, le=5, description="How accurate is the answer compared to the reference?")
    completeness: int = Field(..., ge=1, le=5, description="Does the answer cover the key points?")
    reasoning: str = Field(..., description="Brief explanation of the scores")


JUDGE_PROMPT = """You are evaluating the quality of an AI-generated answer.

Question: {question}

Reference Answer: {reference}

Generated Answer: {generated}

Rate the generated answer on three dimensions (1-5 scale):
1. Relevance: Does it address the question? (1=completely off-topic, 5=directly answers)
2. Accuracy: Is it consistent with the reference? (1=contradicts reference, 5=fully consistent)
3. Completeness: Does it cover the key points? (1=missing everything, 5=covers all key points)

Respond with ONLY a JSON object:
{{"relevance": <1-5>, "accuracy": <1-5>, "completeness": <1-5>, "reasoning": "<brief explanation>"}}"""


async def judge_answer(
    question: str,
    reference_answer: str,
    generated_answer: str,
    judge_provider: LLMProvider,
) -> JudgeScore:
    """Use an LLM to evaluate answer quality."""
    prompt = JUDGE_PROMPT.format(
        question=question,
        reference=reference_answer,
        generated=generated_answer,
    )

    raw = await judge_provider.generate(prompt, temperature=0.0, max_tokens=200)

    # Parse and validate
    cleaned = raw.strip()
    if cleaned.startswith("```"):
        lines = cleaned.split("\n")
        cleaned = "\n".join(lines[1:-1])

    data = json.loads(cleaned)
    return JudgeScore.model_validate(data)
````

### Full Evaluation Pipeline

```python
# eval/run_eval.py
import asyncio
from ai_engineer.retrieval.search import semantic_search
from ai_engineer.prompts.templates import PromptBuilder, RAGPromptInput
from ai_engineer.llm.factory import create_llm_provider
from eval.llm_judge import judge_answer
from eval.dataset import RAG_EVAL_DATASET


async def run_full_evaluation():
    """Run end-to-end evaluation on the dataset."""
    provider = create_llm_provider()
    results = []

    for example in RAG_EVAL_DATASET.examples:
        # Step 1: Retrieve
        chunks = await semantic_search(query=example.question, top_k=5)

        # Step 2: Generate
        if chunks:
            prompt_input = RAGPromptInput(
                question=example.question,
                context_chunks=chunks,
            )
            messages = PromptBuilder.build_rag_prompt(prompt_input)
            generated = await provider.generate(
                messages[-1]["content"],
                temperature=0.1,
            )
        else:
            generated = "I don't have information about this in my knowledge base."

        # Step 3: Judge
        score = await judge_answer(
            question=example.question,
            reference_answer=example.expected_answer,
            generated_answer=generated,
            judge_provider=provider,
        )

        results.append({
            "question": example.question,
            "category": example.category,
            "generated": generated[:200],
            "relevance": score.relevance,
            "accuracy": score.accuracy,
            "completeness": score.completeness,
            "avg_score": (score.relevance + score.accuracy + score.completeness) / 3,
            "reasoning": score.reasoning,
        })

    # Aggregate results
    avg_relevance = sum(r["relevance"] for r in results) / len(results)
    avg_accuracy = sum(r["accuracy"] for r in results) / len(results)
    avg_completeness = sum(r["completeness"] for r in results) / len(results)

    print(f"\n{'='*60}")
    print(f"Evaluation Results ({len(results)} examples)")
    print(f"{'='*60}")
    print(f"Avg Relevance:    {avg_relevance:.2f} / 5.0")
    print(f"Avg Accuracy:     {avg_accuracy:.2f} / 5.0")
    print(f"Avg Completeness: {avg_completeness:.2f} / 5.0")
    print(f"{'='*60}")

    # Show failures (avg score < 3.0)
    failures = [r for r in results if r["avg_score"] < 3.0]
    if failures:
        print(f"\nLow-scoring answers ({len(failures)}):")
        for f in failures:
            print(f"  [{f['avg_score']:.1f}] {f['question'][:60]}...")
            print(f"       Reason: {f['reasoning']}")

    return results


asyncio.run(run_full_evaluation())
```

Output from a typical run:

```
============================================================
Evaluation Results (5 examples)
============================================================
Avg Relevance:    4.40 / 5.0
Avg Accuracy:     4.00 / 5.0
Avg Completeness: 3.80 / 5.0
============================================================

Low-scoring answers (0):
```

***

## Regression Testing

The most valuable use of evaluation is catching regressions. When I change a prompt, swap an embedding model, or update chunking logic, I run the evaluation suite and compare:

```python
# eval/regression.py
import json
from pathlib import Path
from datetime import datetime


def save_eval_results(results: list[dict], label: str) -> str:
    """Save evaluation results with a timestamp for comparison."""
    output_dir = Path("eval/results")
    output_dir.mkdir(parents=True, exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{label}_{timestamp}.json"
    filepath = output_dir / filename

    filepath.write_text(json.dumps(results, indent=2))
    return str(filepath)


def compare_results(baseline_path: str, current_path: str) -> dict:
    """Compare two evaluation runs and report regressions."""
    baseline = json.loads(Path(baseline_path).read_text())
    current = json.loads(Path(current_path).read_text())

    baseline_by_q = {r["question"]: r for r in baseline}
    current_by_q = {r["question"]: r for r in current}

    regressions = []
    improvements = []

    for question in baseline_by_q:
        if question not in current_by_q:
            continue

        b = baseline_by_q[question]
        c = current_by_q[question]

        diff = c["avg_score"] - b["avg_score"]
        if diff < -0.5:
            regressions.append({
                "question": question,
                "baseline_score": b["avg_score"],
                "current_score": c["avg_score"],
                "diff": round(diff, 2),
            })
        elif diff > 0.5:
            improvements.append({
                "question": question,
                "baseline_score": b["avg_score"],
                "current_score": c["avg_score"],
                "diff": round(diff, 2),
            })

    return {
        "regressions": regressions,
        "improvements": improvements,
        "baseline_avg": sum(r["avg_score"] for r in baseline) / len(baseline),
        "current_avg": sum(r["avg_score"] for r in current) / len(current),
    }
```

My workflow:

1. Run evaluation before making changes → save as baseline
2. Make the change (new prompt, different model, chunking update)
3. Run evaluation again → save as current
4. Compare: if regressions > improvements, reconsider the change

***

## Testing in CI

Not all evaluation needs LLM calls. I run the deterministic tests in CI on every push, and the full evaluation suite nightly:

```yaml
# .github/workflows/test.yml
name: Tests

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync --dev
      - run: uv run pytest tests/ -v --tb=short
      - run: uv run ruff check src/
      - run: uv run mypy src/

  # Nightly evaluation — uses LLM API
  evaluation:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run python eval/run_eval.py
        env:
          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
```

***

## Key Takeaways

1. **Separate deterministic and non-deterministic tests.** Unit test everything that doesn't touch an LLM. Evaluate everything that does.
2. **Build evaluation datasets from real usage.** Don't invent test cases — collect them from actual questions your system receives.
3. **Evaluate retrieval separately from generation.** If retrieval is broken, fixing prompts won't help. Isolate the problem.
4. **LLM-as-judge is practical and effective.** It's not perfect, but it's vastly better than manual review for regression detection.
5. **Run evaluations before and after every change.** Prompt changes, model swaps, and chunking updates can all cause regressions. Measure, don't guess.

***

**Previous:** [**Part 6 — Building AI-Powered APIs with FastAPI**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-6-building-ai-apis)

**Next:** [**Part 8 — AI Engineering in Production**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-8-ai-engineering-in-production)