Part 8: AI Engineering in Production

Production Is Where the Real Work Starts

Getting an AI system to work on my laptop was the easy part. Getting it to work reliably in production — with real users, real costs, and real failure modes — required a different mindset.

In my experience, the gap between a working prototype and a production system is larger for AI applications than for traditional software. A traditional API either returns the right data or it doesn't. An AI API can return confident-sounding nonsense, consume unpredictable amounts of money, and fail in ways that are invisible unless you're actively looking for them.

This article covers what I learned putting my RAG service and monitoring agent into production: observability, guardrails, caching, and the framework-versus-build-it-yourself decision.

Observability for LLM Systems

Standard application observability (request count, latency, error rate) is necessary but not sufficient for AI systems. I track three additional dimensions:

Token Usage and Cost

Every LLM call has a cost, and that cost varies by model, input length, and output length. I track it per-request:

# src/ai_engineer/observability/metrics.py
import time
import logging
from dataclasses import dataclass, field
from collections import defaultdict

logger = logging.getLogger(__name__)


@dataclass
class LLMCallMetrics:
    """Metrics for a single LLM call."""

    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    timestamp: float = field(default_factory=time.time)
    prompt_version: str = ""
    endpoint: str = ""
    success: bool = True
    error: str | None = None


class MetricsCollector:
    """Collect and aggregate metrics for LLM calls."""

    def __init__(self) -> None:
        self._calls: list[LLMCallMetrics] = []
        self._hourly_costs: dict[str, float] = defaultdict(float)

    def record(self, metrics: LLMCallMetrics) -> None:
        self._calls.append(metrics)

        # Track hourly costs
        hour_key = time.strftime("%Y-%m-%d-%H", time.localtime(metrics.timestamp))
        cost = self._estimate_cost(metrics)
        self._hourly_costs[hour_key] += cost

        # Log for structured logging / log aggregation
        logger.info(
            "llm_call",
            extra={
                "model": metrics.model,
                "input_tokens": metrics.input_tokens,
                "output_tokens": metrics.output_tokens,
                "latency_ms": metrics.latency_ms,
                "cost_usd": cost,
                "prompt_version": metrics.prompt_version,
                "endpoint": metrics.endpoint,
                "success": metrics.success,
            },
        )

    def _estimate_cost(self, m: LLMCallMetrics) -> float:
        """Estimate cost in USD."""
        pricing = {
            "gpt-4o": (2.50, 10.00),
            "gpt-4o-mini": (0.15, 0.60),
        }
        input_rate, output_rate = pricing.get(m.model, (5.0, 15.0))
        return (m.input_tokens * input_rate + m.output_tokens * output_rate) / 1_000_000

    def summary(self, last_n_hours: int = 24) -> dict:
        """Get a summary of recent metrics."""
        cutoff = time.time() - (last_n_hours * 3600)
        recent = [c for c in self._calls if c.timestamp >= cutoff]

        if not recent:
            return {"total_calls": 0, "total_cost_usd": 0}

        total_cost = sum(self._estimate_cost(c) for c in recent)
        success_count = sum(1 for c in recent if c.success)
        latencies = [c.latency_ms for c in recent]

        return {
            "total_calls": len(recent),
            "success_rate": success_count / len(recent),
            "total_cost_usd": round(total_cost, 4),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 1),
            "p95_latency_ms": round(sorted(latencies)[int(len(latencies) * 0.95)], 1),
            "total_input_tokens": sum(c.input_tokens for c in recent),
            "total_output_tokens": sum(c.output_tokens for c in recent),
        }

Wrapping LLM Calls with Metrics

# src/ai_engineer/llm/instrumented.py
import time

import httpx

from ai_engineer.config import settings
from ai_engineer.observability.metrics import LLMCallMetrics, MetricsCollector


class InstrumentedLLMProvider:
    """LLM provider that records metrics for every call."""

    def __init__(self, collector: MetricsCollector) -> None:
        self._collector = collector
        self._client = httpx.AsyncClient(
            base_url="https://models.inference.ai.azure.com",
            headers={"Authorization": f"Bearer {settings.llm_api_key}"},
            timeout=30.0,
        )

    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
        endpoint: str = "",
        prompt_version: str = "",
    ) -> str:
        start = time.monotonic()
        error_msg = None
        success = True
        input_tokens = 0
        output_tokens = 0
        content = ""

        try:
            response = await self._client.post(
                "/chat/completions",
                json={
                    "model": settings.llm_model,
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": max_tokens,
                    "temperature": temperature,
                },
            )
            response.raise_for_status()
            data = response.json()

            content = data["choices"][0]["message"]["content"]
            usage = data.get("usage", {})
            input_tokens = usage.get("prompt_tokens", 0)
            output_tokens = usage.get("completion_tokens", 0)

        except Exception as e:
            success = False
            error_msg = str(e)
            raise
        finally:
            elapsed_ms = (time.monotonic() - start) * 1000
            self._collector.record(
                LLMCallMetrics(
                    model=settings.llm_model,
                    input_tokens=input_tokens,
                    output_tokens=output_tokens,
                    latency_ms=elapsed_ms,
                    prompt_version=prompt_version,
                    endpoint=endpoint,
                    success=success,
                    error=error_msg,
                )
            )

        return content

What My Dashboard Shows

After running in production for a few weeks, the metrics I check daily:

Daily Summary (2026-04-02)
─────────────────────────
Total calls:        847
Success rate:       99.4%
Total cost:         $0.42
Avg latency:        1,230 ms
P95 latency:        2,890 ms
Input tokens:       1,247,000
Output tokens:      389,000

Top endpoints:
  /ask           623 calls  $0.31
  /ask/stream    189 calls  $0.09
  /extract        35 calls  $0.02

Hourly cost trend:
  00:00  $0.01 ▎
  08:00  $0.03 ▎▎▎
  09:00  $0.05 ▎▎▎▎▎
  10:00  $0.04 ▎▎▎▎
  ...

Guardrails and Content Filtering

LLMs can generate content that's incorrect, inappropriate, or harmful. In production, I apply guardrails at multiple levels:

Input Guardrails

# src/ai_engineer/guardrails/input.py
import re
from pydantic import BaseModel


class InputCheckResult(BaseModel):
    allowed: bool
    reason: str | None = None


def check_input(text: str) -> InputCheckResult:
    """Check user input before processing."""
    # Length check
    if len(text) > 5000:
        return InputCheckResult(
            allowed=False,
            reason="Input exceeds maximum length of 5000 characters",
        )

    # Empty or whitespace-only
    if not text.strip():
        return InputCheckResult(
            allowed=False,
            reason="Input is empty",
        )

    # Basic injection pattern detection
    injection_patterns = [
        r"ignore\s+(?:all\s+)?(?:previous\s+)?instructions",
        r"you\s+are\s+now\s+(?:a|an)",
        r"<\|(?:im_start|system)\|>",
        r"```\s*system",
    ]

    for pattern in injection_patterns:
        if re.search(pattern, text, re.IGNORECASE):
            return InputCheckResult(
                allowed=False,
                reason="Input contains potentially harmful patterns",
            )

    return InputCheckResult(allowed=True)

Output Guardrails

# src/ai_engineer/guardrails/output.py
from pydantic import BaseModel


class OutputCheckResult(BaseModel):
    safe: bool
    filtered_content: str
    reason: str | None = None


def check_output(text: str) -> OutputCheckResult:
    """Check LLM output before returning to the user."""
    # Empty response
    if not text.strip():
        return OutputCheckResult(
            safe=False,
            filtered_content="I was unable to generate a response.",
            reason="Empty LLM output",
        )

    # Excessive length — model running away
    max_length = 10000
    if len(text) > max_length:
        return OutputCheckResult(
            safe=True,
            filtered_content=text[:max_length] + "\n\n[Response truncated]",
            reason="Response exceeded maximum length",
        )

    # Check for potential data leakage (e.g., model regurgitating API keys)
    # This is a basic check — production systems need more sophisticated detection
    import re
    sensitive_patterns = [
        r"(?:sk-|ghp_|AKIA)[A-Za-z0-9]{20,}",  # API key patterns
        r"\b\d{3}-\d{2}-\d{4}\b",  # SSN-like patterns
    ]

    filtered = text
    for pattern in sensitive_patterns:
        filtered = re.sub(pattern, "[REDACTED]", filtered)

    if filtered != text:
        return OutputCheckResult(
            safe=True,
            filtered_content=filtered,
            reason="Potentially sensitive content was redacted",
        )

    return OutputCheckResult(safe=True, filtered_content=text)

Applying Guardrails in the Request Pipeline

@app.post("/ask", response_model=AnswerResponse)
async def ask_question(request: QuestionRequest) -> AnswerResponse:
    # Input guardrail
    input_check = check_input(request.question)
    if not input_check.allowed:
        raise HTTPException(status_code=400, detail=input_check.reason)

    # ... retrieval and generation ...

    # Output guardrail
    output_check = check_output(raw_answer)
    if not output_check.safe:
        logger.warning("Output filtered", extra={"reason": output_check.reason})

    return AnswerResponse(
        answer=output_check.filtered_content,
        # ... other fields
    )

Caching Strategies

LLM calls are expensive and slow. Caching can dramatically reduce both cost and latency.

Embedding Cache

Embeddings are deterministic — the same input always produces the same output. I cache them aggressively:

# src/ai_engineer/cache/embedding_cache.py
import hashlib
import json
from pathlib import Path


class EmbeddingCache:
    """File-based cache for embeddings.

    Simple but effective for single-server deployments.
    For multi-server, use Redis or a shared cache.
    """

    def __init__(self, cache_dir: str = ".cache/embeddings") -> None:
        self._dir = Path(cache_dir)
        self._dir.mkdir(parents=True, exist_ok=True)

    def _key(self, text: str, model: str) -> str:
        content = f"{model}:{text}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, text: str, model: str) -> list[float] | None:
        path = self._dir / f"{self._key(text, model)}.json"
        if path.exists():
            return json.loads(path.read_text())
        return None

    def set(self, text: str, model: str, embedding: list[float]) -> None:
        path = self._dir / f"{self._key(text, model)}.json"
        path.write_text(json.dumps(embedding))

Response Cache for Repeated Queries

Some questions are asked frequently. I cache responses with a TTL:

# src/ai_engineer/cache/response_cache.py
import hashlib
import time
from dataclasses import dataclass


@dataclass
class CachedResponse:
    answer: str
    sources: list[dict]
    cached_at: float
    ttl_seconds: float


class ResponseCache:
    """In-memory cache for LLM responses with TTL."""

    def __init__(self, default_ttl: float = 3600) -> None:
        self._cache: dict[str, CachedResponse] = {}
        self._default_ttl = default_ttl

    def _key(self, question: str) -> str:
        # Normalize question for cache matching
        normalized = question.strip().lower()
        return hashlib.sha256(normalized.encode()).hexdigest()

    def get(self, question: str) -> CachedResponse | None:
        key = self._key(question)
        cached = self._cache.get(key)
        if cached is None:
            return None
        if time.time() - cached.cached_at > cached.ttl_seconds:
            del self._cache[key]
            return None
        return cached

    def set(
        self,
        question: str,
        answer: str,
        sources: list[dict],
        ttl: float | None = None,
    ) -> None:
        key = self._key(question)
        self._cache[key] = CachedResponse(
            answer=answer,
            sources=sources,
            cached_at=time.time(),
            ttl_seconds=ttl or self._default_ttl,
        )

    @property
    def size(self) -> int:
        return len(self._cache)

Cache Impact

After adding caching to my RAG service, the numbers improved significantly:

Metric

Before Cache

After Cache

Avg latency

1,230 ms

340 ms

P95 latency

2,890 ms

1,450 ms

Daily cost

$0.42

$0.18

Cache hit rate

—

38%

The 38% cache hit rate comes from repeated or near-identical questions. In a knowledge-base system, people tend to ask about the same topics.

When to Use a Framework vs Building From Scratch

This is a decision I've gone back and forth on. Here's where I've landed:

Frameworks I've Used

LangChain: Full-featured, lots of integrations. But the abstractions add complexity and make debugging harder. When my RAG pipeline broke, I spent more time understanding LangChain's internals than the actual problem.
LlamaIndex: Better for pure RAG use cases. The indexing abstractions are useful if your document formats are complex.
Plain httpx + FastAPI: Maximum control, minimum magic. Everything is explicit and debuggable.

My Decision Framework

Do you need advanced features that would take weeks to build?
  → Yes: Use a framework (fine-tuning orchestration, multi-modal, complex agents)
  → No: Continue

Do you need to deeply understand the behavior of your system?
  → Yes: Build from scratch (you'll debug it eventually anyway)
  → No: Continue

Is this a prototype or proof of concept?
  → Yes: Use a framework (speed matters)
  → No: Build from scratch (maintenance cost matters)

For my personal projects, I build from scratch. The code in this series uses only httpx, FastAPI, sentence-transformers, and SQLAlchemy. No LangChain, no LlamaIndex, no orchestration frameworks. I understand every line, and when something breaks, I know where to look.

For a team project with tight deadlines, I'd start with LangChain or LlamaIndex and migrate critical paths to custom code as the system matures.

Deployment

My AI services deploy the same way as any other Python service:

Dockerfile

FROM python:3.12-slim

WORKDIR /app

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Install dependencies
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

# Copy application
COPY src/ src/

EXPOSE 8000

CMD ["uv", "run", "uvicorn", "ai_engineer.main:app", "--host", "0.0.0.0", "--port", "8000"]

docker-compose.yml for Production

services:
  app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql+asyncpg://postgres:${DB_PASSWORD}@postgres:5432/ai_engineer
      - LLM_API_KEY=${LLM_API_KEY}
      - LLM_PROVIDER=github
      - LLM_MODEL=gpt-4o
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: ai_engineer
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

volumes:
  pgdata:

Health Check Endpoint

@app.get("/health")
async def health() -> dict:
    """Health check for load balancers and container orchestration."""
    # Check database connectivity
    try:
        async with async_session() as session:
            await session.execute(text("SELECT 1"))
        db_status = "ok"
    except Exception:
        db_status = "error"

    # Check LLM provider connectivity
    try:
        await app.state.llm.generate("ping", max_tokens=5, temperature=0)
        llm_status = "ok"
    except Exception:
        llm_status = "error"

    status = "ok" if db_status == "ok" and llm_status == "ok" else "degraded"

    return {
        "status": status,
        "database": db_status,
        "llm_provider": llm_status,
        "model": settings.llm_model,
    }

What I'd Do Differently

Looking back at my AI engineering journey, here's what I'd change if I started over:

Start with evaluation. I built the system first and added evaluation later. Evaluation should come first — even a simple dataset of 10 questions with expected answers would have caught issues I spent hours debugging.
Log everything from day one. Every prompt, every response, every latency measurement. I lost valuable data from early usage because I didn't set up structured logging until later.
Use temperature=0 by default. I wasted time debugging inconsistent outputs that were just sampling randomness. Start with deterministic outputs and add temperature only when you need variation.
Build the provider abstraction early. I hardcoded the OpenAI client in my first project. When I wanted to try Claude, I had to refactor everything. The Protocol-based abstraction from Part 2 takes 10 minutes to set up and saves hours later.
Don't optimize prematurely. My first RAG system embedded 50,000 chunks with a complex indexing pipeline. I could have started with 500 and learned 90% of what I needed at 1% of the complexity.

The Complete Picture

Across this series, we've built every layer of an AI-powered system:

┌─────────────────────────────────────────────────────┐
│                   Production                         │
│  Observability │ Guardrails │ Caching │ Deployment   │  ← Part 8
├─────────────────────────────────────────────────────┤
│                   Evaluation                         │
│  Retrieval eval │ LLM-as-judge │ Regression tests   │  ← Part 7
├─────────────────────────────────────────────────────┤
│                   API Layer                          │
│  FastAPI │ Streaming │ Rate limiting │ Cost control  │  ← Part 6
├─────────────────────────────────────────────────────┤
│                   Prompt Engineering                 │
│  System prompts │ Templates │ Structured output      │  ← Part 5
├─────────────────────────────────────────────────────┤
│               Embeddings & Vector Search             │
│  sentence-transformers │ pgvector │ Semantic search  │  ← Part 4
├─────────────────────────────────────────────────────┤
│                   LLM Understanding                  │
│  Tokens │ Context windows │ Temperature │ Providers  │  ← Part 3
├─────────────────────────────────────────────────────┤
│                   Python Tooling                     │
│  uv │ FastAPI │ Pydantic │ Ruff │ mypy │ pytest     │  ← Part 2
├─────────────────────────────────────────────────────┤
│                   Foundations                        │
│  Role definition │ Skills map │ Learning path        │  ← Part 1
└─────────────────────────────────────────────────────┘

Every piece is something I've built, used, and debugged in my own projects. The code runs. The patterns work. The problems I discussed are problems I actually encountered.

AI engineering is still a young field. The tools, models, and best practices evolve fast. But the fundamentals — software engineering discipline, understanding your tools, testing what you build, and measuring what matters — don't change.

Previous: Part 7 — Evaluating and Testing AI Systems

Back to: AI Engineer 101 — Series Overview

PreviousPart 7: Evaluating and Testing AI Systems NextMulti Agent Orchestration 101

Last updated 7 hours ago

hashtagProduction Is Where the Real Work Starts

hashtagObservability for LLM Systems

hashtagToken Usage and Cost

hashtagWrapping LLM Calls with Metrics

hashtagWhat My Dashboard Shows

hashtagGuardrails and Content Filtering

hashtagInput Guardrails

hashtagOutput Guardrails

hashtagApplying Guardrails in the Request Pipeline

hashtagCaching Strategies

hashtagEmbedding Cache

hashtagResponse Cache for Repeated Queries

hashtagCache Impact

hashtagWhen to Use a Framework vs Building From Scratch

hashtagFrameworks I've Used

hashtagMy Decision Framework

hashtagDeployment

hashtagDockerfile

hashtagdocker-compose.yml for Production

hashtagHealth Check Endpoint

hashtagWhat I'd Do Differently

hashtagThe Complete Picture