Part 7: Evaluating and Testing AI Systems

The Hardest Part of AI Engineering

Testing traditional software is straightforward: given input X, expect output Y. Testing AI systems is fundamentally different. When I ask my RAG service "What is pgvector?", there's no single correct answer. A good answer might mention PostgreSQL extensions, vector types, and similarity search. A different good answer might focus on installation and indexing. Both are correct, but they're different strings.

This was the most frustrating part of my AI engineering journey. I'd make a change to a prompt or switch embedding models and have no systematic way to know if the system got better or worse. I was relying on "it looks about right" β€” which is fine for a personal project but terrible engineering practice.

This article documents the evaluation approach I developed for my own projects. It's not perfect, but it's dramatically better than manual spot-checking.


Two Types of Testing

AI systems have both deterministic and non-deterministic components. I test them with different strategies:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            AI System                         β”‚
β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Deterministic   β”‚  β”‚ Non-Deterministic β”‚  β”‚
β”‚  β”‚                  β”‚  β”‚                   β”‚  β”‚
β”‚  β”‚ β€’ Config loading β”‚  β”‚ β€’ LLM responses   β”‚  β”‚
β”‚  β”‚ β€’ Input validationβ”‚ β”‚ β€’ Retrieval rankingβ”‚  β”‚
β”‚  β”‚ β€’ Prompt buildingβ”‚  β”‚ β€’ Answer quality   β”‚  β”‚
β”‚  β”‚ β€’ Token counting β”‚  β”‚ β€’ Relevance scores β”‚  β”‚
β”‚  β”‚ β€’ Output parsing β”‚  β”‚                   β”‚  β”‚
β”‚  β”‚                  β”‚  β”‚                   β”‚  β”‚
β”‚  β”‚  β†’ Unit tests    β”‚  β”‚  β†’ Evaluations    β”‚  β”‚
β”‚  β”‚  (pytest)        β”‚  β”‚  (eval framework) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Unit Testing the Deterministic Parts

These tests are fast, free (no API calls), and run in CI:


Building an Evaluation Dataset

The foundation of AI evaluation is a dataset of questions with expected answers. I built mine from my own usage of the RAG system:

How I Build Evaluation Datasets

I don't invent test cases. I collect them from real usage:

  1. Log every question. My RAG service logs every question it receives. After a week of usage, I had 50+ real questions.

  2. Write reference answers for a subset. I manually wrote reference answers for 20-30 diverse questions. This takes time but is essential.

  3. Categorize by difficulty. Some questions are straightforward retrieval ("How do I install X?"), some are conceptual ("What's the difference between X and Y?"), and some are out-of-scope. Each category reveals different failure modes.

  4. Include negative examples. Questions the system should refuse to answer are just as important as questions it should answer.


Retrieval Evaluation

Before evaluating the full system, I evaluate retrieval separately. If retrieval returns irrelevant chunks, no prompt engineering will save the answer quality.

Running this on my dataset:

Output:

When recall drops, I investigate: is it a chunking problem (relevant content split across chunks)? An embedding quality problem? Or a question that's genuinely hard to match?


LLM-as-Judge Evaluation

For evaluating answer quality, I use a technique called LLM-as-judge: I ask a second LLM to evaluate the output of the first.

Full Evaluation Pipeline

Output from a typical run:


Regression Testing

The most valuable use of evaluation is catching regressions. When I change a prompt, swap an embedding model, or update chunking logic, I run the evaluation suite and compare:

My workflow:

  1. Run evaluation before making changes β†’ save as baseline

  2. Make the change (new prompt, different model, chunking update)

  3. Run evaluation again β†’ save as current

  4. Compare: if regressions > improvements, reconsider the change


Testing in CI

Not all evaluation needs LLM calls. I run the deterministic tests in CI on every push, and the full evaluation suite nightly:


Key Takeaways

  1. Separate deterministic and non-deterministic tests. Unit test everything that doesn't touch an LLM. Evaluate everything that does.

  2. Build evaluation datasets from real usage. Don't invent test cases β€” collect them from actual questions your system receives.

  3. Evaluate retrieval separately from generation. If retrieval is broken, fixing prompts won't help. Isolate the problem.

  4. LLM-as-judge is practical and effective. It's not perfect, but it's vastly better than manual review for regression detection.

  5. Run evaluations before and after every change. Prompt changes, model swaps, and chunking updates can all cause regressions. Measure, don't guess.


Previous: Part 6 β€” Building AI-Powered APIs with FastAPI

Next: Part 8 β€” AI Engineering in Production

Last updated