Part 8: AI Engineering in Production

Production Is Where the Real Work Starts

Getting an AI system to work on my laptop was the easy part. Getting it to work reliably in production — with real users, real costs, and real failure modes — required a different mindset.

In my experience, the gap between a working prototype and a production system is larger for AI applications than for traditional software. A traditional API either returns the right data or it doesn't. An AI API can return confident-sounding nonsense, consume unpredictable amounts of money, and fail in ways that are invisible unless you're actively looking for them.

This article covers what I learned putting my RAG service and monitoring agent into production: observability, guardrails, caching, and the framework-versus-build-it-yourself decision.


Observability for LLM Systems

Standard application observability (request count, latency, error rate) is necessary but not sufficient for AI systems. I track three additional dimensions:

Token Usage and Cost

Every LLM call has a cost, and that cost varies by model, input length, and output length. I track it per-request:

# src/ai_engineer/observability/metrics.py
import time
import logging
from dataclasses import dataclass, field
from collections import defaultdict

logger = logging.getLogger(__name__)


@dataclass
class LLMCallMetrics:
    """Metrics for a single LLM call."""

    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    timestamp: float = field(default_factory=time.time)
    prompt_version: str = ""
    endpoint: str = ""
    success: bool = True
    error: str | None = None


class MetricsCollector:
    """Collect and aggregate metrics for LLM calls."""

    def __init__(self) -> None:
        self._calls: list[LLMCallMetrics] = []
        self._hourly_costs: dict[str, float] = defaultdict(float)

    def record(self, metrics: LLMCallMetrics) -> None:
        self._calls.append(metrics)

        # Track hourly costs
        hour_key = time.strftime("%Y-%m-%d-%H", time.localtime(metrics.timestamp))
        cost = self._estimate_cost(metrics)
        self._hourly_costs[hour_key] += cost

        # Log for structured logging / log aggregation
        logger.info(
            "llm_call",
            extra={
                "model": metrics.model,
                "input_tokens": metrics.input_tokens,
                "output_tokens": metrics.output_tokens,
                "latency_ms": metrics.latency_ms,
                "cost_usd": cost,
                "prompt_version": metrics.prompt_version,
                "endpoint": metrics.endpoint,
                "success": metrics.success,
            },
        )

    def _estimate_cost(self, m: LLMCallMetrics) -> float:
        """Estimate cost in USD."""
        pricing = {
            "gpt-4o": (2.50, 10.00),
            "gpt-4o-mini": (0.15, 0.60),
        }
        input_rate, output_rate = pricing.get(m.model, (5.0, 15.0))
        return (m.input_tokens * input_rate + m.output_tokens * output_rate) / 1_000_000

    def summary(self, last_n_hours: int = 24) -> dict:
        """Get a summary of recent metrics."""
        cutoff = time.time() - (last_n_hours * 3600)
        recent = [c for c in self._calls if c.timestamp >= cutoff]

        if not recent:
            return {"total_calls": 0, "total_cost_usd": 0}

        total_cost = sum(self._estimate_cost(c) for c in recent)
        success_count = sum(1 for c in recent if c.success)
        latencies = [c.latency_ms for c in recent]

        return {
            "total_calls": len(recent),
            "success_rate": success_count / len(recent),
            "total_cost_usd": round(total_cost, 4),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 1),
            "p95_latency_ms": round(sorted(latencies)[int(len(latencies) * 0.95)], 1),
            "total_input_tokens": sum(c.input_tokens for c in recent),
            "total_output_tokens": sum(c.output_tokens for c in recent),
        }

Wrapping LLM Calls with Metrics

What My Dashboard Shows

After running in production for a few weeks, the metrics I check daily:


Guardrails and Content Filtering

LLMs can generate content that's incorrect, inappropriate, or harmful. In production, I apply guardrails at multiple levels:

Input Guardrails

Output Guardrails

Applying Guardrails in the Request Pipeline


Caching Strategies

LLM calls are expensive and slow. Caching can dramatically reduce both cost and latency.

Embedding Cache

Embeddings are deterministic — the same input always produces the same output. I cache them aggressively:

Response Cache for Repeated Queries

Some questions are asked frequently. I cache responses with a TTL:

Cache Impact

After adding caching to my RAG service, the numbers improved significantly:

Metric
Before Cache
After Cache

Avg latency

1,230 ms

340 ms

P95 latency

2,890 ms

1,450 ms

Daily cost

$0.42

$0.18

Cache hit rate

38%

The 38% cache hit rate comes from repeated or near-identical questions. In a knowledge-base system, people tend to ask about the same topics.


When to Use a Framework vs Building From Scratch

This is a decision I've gone back and forth on. Here's where I've landed:

Frameworks I've Used

  • LangChain: Full-featured, lots of integrations. But the abstractions add complexity and make debugging harder. When my RAG pipeline broke, I spent more time understanding LangChain's internals than the actual problem.

  • LlamaIndex: Better for pure RAG use cases. The indexing abstractions are useful if your document formats are complex.

  • Plain httpx + FastAPI: Maximum control, minimum magic. Everything is explicit and debuggable.

My Decision Framework

For my personal projects, I build from scratch. The code in this series uses only httpx, FastAPI, sentence-transformers, and SQLAlchemy. No LangChain, no LlamaIndex, no orchestration frameworks. I understand every line, and when something breaks, I know where to look.

For a team project with tight deadlines, I'd start with LangChain or LlamaIndex and migrate critical paths to custom code as the system matures.


Deployment

My AI services deploy the same way as any other Python service:

Dockerfile

docker-compose.yml for Production

Health Check Endpoint


What I'd Do Differently

Looking back at my AI engineering journey, here's what I'd change if I started over:

  1. Start with evaluation. I built the system first and added evaluation later. Evaluation should come first — even a simple dataset of 10 questions with expected answers would have caught issues I spent hours debugging.

  2. Log everything from day one. Every prompt, every response, every latency measurement. I lost valuable data from early usage because I didn't set up structured logging until later.

  3. Use temperature=0 by default. I wasted time debugging inconsistent outputs that were just sampling randomness. Start with deterministic outputs and add temperature only when you need variation.

  4. Build the provider abstraction early. I hardcoded the OpenAI client in my first project. When I wanted to try Claude, I had to refactor everything. The Protocol-based abstraction from Part 2 takes 10 minutes to set up and saves hours later.

  5. Don't optimize prematurely. My first RAG system embedded 50,000 chunks with a complex indexing pipeline. I could have started with 500 and learned 90% of what I needed at 1% of the complexity.


The Complete Picture

Across this series, we've built every layer of an AI-powered system:

Every piece is something I've built, used, and debugged in my own projects. The code runs. The patterns work. The problems I discussed are problems I actually encountered.

AI engineering is still a young field. The tools, models, and best practices evolve fast. But the fundamentals — software engineering discipline, understanding your tools, testing what you build, and measuring what matters — don't change.


Previous: Part 7 — Evaluating and Testing AI Systems

Back to: AI Engineer 101 — Series Overview

Last updated