Part 7: Evaluating and Testing AI Systems
The Hardest Part of AI Engineering
Two Types of Testing
βββββββββββββββββββββββββββββββββββββββββββββββ
β AI System β
β β
β βββββββββββββββββββ ββββββββββββββββββββ β
β β Deterministic β β Non-Deterministic β β
β β β β β β
β β β’ Config loading β β β’ LLM responses β β
β β β’ Input validationβ β β’ Retrieval rankingβ β
β β β’ Prompt buildingβ β β’ Answer quality β β
β β β’ Token counting β β β’ Relevance scores β β
β β β’ Output parsing β β β β
β β β β β β
β β β Unit tests β β β Evaluations β β
β β (pytest) β β (eval framework) β β
β βββββββββββββββββββ ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββUnit Testing the Deterministic Parts
Building an Evaluation Dataset
How I Build Evaluation Datasets
Retrieval Evaluation
LLM-as-Judge Evaluation
Full Evaluation Pipeline
Regression Testing
Testing in CI
Key Takeaways
Last updated