# Part 8: AI Engineering in Production

## Production Is Where the Real Work Starts

Getting an AI system to work on my laptop was the easy part. Getting it to work reliably in production — with real users, real costs, and real failure modes — required a different mindset.

In my experience, the gap between a working prototype and a production system is larger for AI applications than for traditional software. A traditional API either returns the right data or it doesn't. An AI API can return confident-sounding nonsense, consume unpredictable amounts of money, and fail in ways that are invisible unless you're actively looking for them.

This article covers what I learned putting my RAG service and monitoring agent into production: observability, guardrails, caching, and the framework-versus-build-it-yourself decision.

***

## Observability for LLM Systems

Standard application observability (request count, latency, error rate) is necessary but not sufficient for AI systems. I track three additional dimensions:

### Token Usage and Cost

Every LLM call has a cost, and that cost varies by model, input length, and output length. I track it per-request:

```python
# src/ai_engineer/observability/metrics.py
import time
import logging
from dataclasses import dataclass, field
from collections import defaultdict

logger = logging.getLogger(__name__)


@dataclass
class LLMCallMetrics:
    """Metrics for a single LLM call."""

    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    timestamp: float = field(default_factory=time.time)
    prompt_version: str = ""
    endpoint: str = ""
    success: bool = True
    error: str | None = None


class MetricsCollector:
    """Collect and aggregate metrics for LLM calls."""

    def __init__(self) -> None:
        self._calls: list[LLMCallMetrics] = []
        self._hourly_costs: dict[str, float] = defaultdict(float)

    def record(self, metrics: LLMCallMetrics) -> None:
        self._calls.append(metrics)

        # Track hourly costs
        hour_key = time.strftime("%Y-%m-%d-%H", time.localtime(metrics.timestamp))
        cost = self._estimate_cost(metrics)
        self._hourly_costs[hour_key] += cost

        # Log for structured logging / log aggregation
        logger.info(
            "llm_call",
            extra={
                "model": metrics.model,
                "input_tokens": metrics.input_tokens,
                "output_tokens": metrics.output_tokens,
                "latency_ms": metrics.latency_ms,
                "cost_usd": cost,
                "prompt_version": metrics.prompt_version,
                "endpoint": metrics.endpoint,
                "success": metrics.success,
            },
        )

    def _estimate_cost(self, m: LLMCallMetrics) -> float:
        """Estimate cost in USD."""
        pricing = {
            "gpt-4o": (2.50, 10.00),
            "gpt-4o-mini": (0.15, 0.60),
        }
        input_rate, output_rate = pricing.get(m.model, (5.0, 15.0))
        return (m.input_tokens * input_rate + m.output_tokens * output_rate) / 1_000_000

    def summary(self, last_n_hours: int = 24) -> dict:
        """Get a summary of recent metrics."""
        cutoff = time.time() - (last_n_hours * 3600)
        recent = [c for c in self._calls if c.timestamp >= cutoff]

        if not recent:
            return {"total_calls": 0, "total_cost_usd": 0}

        total_cost = sum(self._estimate_cost(c) for c in recent)
        success_count = sum(1 for c in recent if c.success)
        latencies = [c.latency_ms for c in recent]

        return {
            "total_calls": len(recent),
            "success_rate": success_count / len(recent),
            "total_cost_usd": round(total_cost, 4),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 1),
            "p95_latency_ms": round(sorted(latencies)[int(len(latencies) * 0.95)], 1),
            "total_input_tokens": sum(c.input_tokens for c in recent),
            "total_output_tokens": sum(c.output_tokens for c in recent),
        }
```

### Wrapping LLM Calls with Metrics

```python
# src/ai_engineer/llm/instrumented.py
import time

import httpx

from ai_engineer.config import settings
from ai_engineer.observability.metrics import LLMCallMetrics, MetricsCollector


class InstrumentedLLMProvider:
    """LLM provider that records metrics for every call."""

    def __init__(self, collector: MetricsCollector) -> None:
        self._collector = collector
        self._client = httpx.AsyncClient(
            base_url="https://models.inference.ai.azure.com",
            headers={"Authorization": f"Bearer {settings.llm_api_key}"},
            timeout=30.0,
        )

    async def generate(
        self,
        prompt: str,
        *,
        max_tokens: int = 512,
        temperature: float = 0.1,
        endpoint: str = "",
        prompt_version: str = "",
    ) -> str:
        start = time.monotonic()
        error_msg = None
        success = True
        input_tokens = 0
        output_tokens = 0
        content = ""

        try:
            response = await self._client.post(
                "/chat/completions",
                json={
                    "model": settings.llm_model,
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": max_tokens,
                    "temperature": temperature,
                },
            )
            response.raise_for_status()
            data = response.json()

            content = data["choices"][0]["message"]["content"]
            usage = data.get("usage", {})
            input_tokens = usage.get("prompt_tokens", 0)
            output_tokens = usage.get("completion_tokens", 0)

        except Exception as e:
            success = False
            error_msg = str(e)
            raise
        finally:
            elapsed_ms = (time.monotonic() - start) * 1000
            self._collector.record(
                LLMCallMetrics(
                    model=settings.llm_model,
                    input_tokens=input_tokens,
                    output_tokens=output_tokens,
                    latency_ms=elapsed_ms,
                    prompt_version=prompt_version,
                    endpoint=endpoint,
                    success=success,
                    error=error_msg,
                )
            )

        return content
```

### What My Dashboard Shows

After running in production for a few weeks, the metrics I check daily:

```
Daily Summary (2026-04-02)
─────────────────────────
Total calls:        847
Success rate:       99.4%
Total cost:         $0.42
Avg latency:        1,230 ms
P95 latency:        2,890 ms
Input tokens:       1,247,000
Output tokens:      389,000

Top endpoints:
  /ask           623 calls  $0.31
  /ask/stream    189 calls  $0.09
  /extract        35 calls  $0.02

Hourly cost trend:
  00:00  $0.01 ▎
  08:00  $0.03 ▎▎▎
  09:00  $0.05 ▎▎▎▎▎
  10:00  $0.04 ▎▎▎▎
  ...
```

***

## Guardrails and Content Filtering

LLMs can generate content that's incorrect, inappropriate, or harmful. In production, I apply guardrails at multiple levels:

### Input Guardrails

````python
# src/ai_engineer/guardrails/input.py
import re
from pydantic import BaseModel


class InputCheckResult(BaseModel):
    allowed: bool
    reason: str | None = None


def check_input(text: str) -> InputCheckResult:
    """Check user input before processing."""
    # Length check
    if len(text) > 5000:
        return InputCheckResult(
            allowed=False,
            reason="Input exceeds maximum length of 5000 characters",
        )

    # Empty or whitespace-only
    if not text.strip():
        return InputCheckResult(
            allowed=False,
            reason="Input is empty",
        )

    # Basic injection pattern detection
    injection_patterns = [
        r"ignore\s+(?:all\s+)?(?:previous\s+)?instructions",
        r"you\s+are\s+now\s+(?:a|an)",
        r"<\|(?:im_start|system)\|>",
        r"```\s*system",
    ]

    for pattern in injection_patterns:
        if re.search(pattern, text, re.IGNORECASE):
            return InputCheckResult(
                allowed=False,
                reason="Input contains potentially harmful patterns",
            )

    return InputCheckResult(allowed=True)
````

### Output Guardrails

```python
# src/ai_engineer/guardrails/output.py
from pydantic import BaseModel


class OutputCheckResult(BaseModel):
    safe: bool
    filtered_content: str
    reason: str | None = None


def check_output(text: str) -> OutputCheckResult:
    """Check LLM output before returning to the user."""
    # Empty response
    if not text.strip():
        return OutputCheckResult(
            safe=False,
            filtered_content="I was unable to generate a response.",
            reason="Empty LLM output",
        )

    # Excessive length — model running away
    max_length = 10000
    if len(text) > max_length:
        return OutputCheckResult(
            safe=True,
            filtered_content=text[:max_length] + "\n\n[Response truncated]",
            reason="Response exceeded maximum length",
        )

    # Check for potential data leakage (e.g., model regurgitating API keys)
    # This is a basic check — production systems need more sophisticated detection
    import re
    sensitive_patterns = [
        r"(?:sk-|ghp_|AKIA)[A-Za-z0-9]{20,}",  # API key patterns
        r"\b\d{3}-\d{2}-\d{4}\b",  # SSN-like patterns
    ]

    filtered = text
    for pattern in sensitive_patterns:
        filtered = re.sub(pattern, "[REDACTED]", filtered)

    if filtered != text:
        return OutputCheckResult(
            safe=True,
            filtered_content=filtered,
            reason="Potentially sensitive content was redacted",
        )

    return OutputCheckResult(safe=True, filtered_content=text)
```

### Applying Guardrails in the Request Pipeline

```python
@app.post("/ask", response_model=AnswerResponse)
async def ask_question(request: QuestionRequest) -> AnswerResponse:
    # Input guardrail
    input_check = check_input(request.question)
    if not input_check.allowed:
        raise HTTPException(status_code=400, detail=input_check.reason)

    # ... retrieval and generation ...

    # Output guardrail
    output_check = check_output(raw_answer)
    if not output_check.safe:
        logger.warning("Output filtered", extra={"reason": output_check.reason})

    return AnswerResponse(
        answer=output_check.filtered_content,
        # ... other fields
    )
```

***

## Caching Strategies

LLM calls are expensive and slow. Caching can dramatically reduce both cost and latency.

### Embedding Cache

Embeddings are deterministic — the same input always produces the same output. I cache them aggressively:

```python
# src/ai_engineer/cache/embedding_cache.py
import hashlib
import json
from pathlib import Path


class EmbeddingCache:
    """File-based cache for embeddings.

    Simple but effective for single-server deployments.
    For multi-server, use Redis or a shared cache.
    """

    def __init__(self, cache_dir: str = ".cache/embeddings") -> None:
        self._dir = Path(cache_dir)
        self._dir.mkdir(parents=True, exist_ok=True)

    def _key(self, text: str, model: str) -> str:
        content = f"{model}:{text}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, text: str, model: str) -> list[float] | None:
        path = self._dir / f"{self._key(text, model)}.json"
        if path.exists():
            return json.loads(path.read_text())
        return None

    def set(self, text: str, model: str, embedding: list[float]) -> None:
        path = self._dir / f"{self._key(text, model)}.json"
        path.write_text(json.dumps(embedding))
```

### Response Cache for Repeated Queries

Some questions are asked frequently. I cache responses with a TTL:

```python
# src/ai_engineer/cache/response_cache.py
import hashlib
import time
from dataclasses import dataclass


@dataclass
class CachedResponse:
    answer: str
    sources: list[dict]
    cached_at: float
    ttl_seconds: float


class ResponseCache:
    """In-memory cache for LLM responses with TTL."""

    def __init__(self, default_ttl: float = 3600) -> None:
        self._cache: dict[str, CachedResponse] = {}
        self._default_ttl = default_ttl

    def _key(self, question: str) -> str:
        # Normalize question for cache matching
        normalized = question.strip().lower()
        return hashlib.sha256(normalized.encode()).hexdigest()

    def get(self, question: str) -> CachedResponse | None:
        key = self._key(question)
        cached = self._cache.get(key)
        if cached is None:
            return None
        if time.time() - cached.cached_at > cached.ttl_seconds:
            del self._cache[key]
            return None
        return cached

    def set(
        self,
        question: str,
        answer: str,
        sources: list[dict],
        ttl: float | None = None,
    ) -> None:
        key = self._key(question)
        self._cache[key] = CachedResponse(
            answer=answer,
            sources=sources,
            cached_at=time.time(),
            ttl_seconds=ttl or self._default_ttl,
        )

    @property
    def size(self) -> int:
        return len(self._cache)
```

### Cache Impact

After adding caching to my RAG service, the numbers improved significantly:

| Metric         | Before Cache | After Cache |
| -------------- | ------------ | ----------- |
| Avg latency    | 1,230 ms     | 340 ms      |
| P95 latency    | 2,890 ms     | 1,450 ms    |
| Daily cost     | $0.42        | $0.18       |
| Cache hit rate | —            | 38%         |

The 38% cache hit rate comes from repeated or near-identical questions. In a knowledge-base system, people tend to ask about the same topics.

***

## When to Use a Framework vs Building From Scratch

This is a decision I've gone back and forth on. Here's where I've landed:

### Frameworks I've Used

* **LangChain**: Full-featured, lots of integrations. But the abstractions add complexity and make debugging harder. When my RAG pipeline broke, I spent more time understanding LangChain's internals than the actual problem.
* **LlamaIndex**: Better for pure RAG use cases. The indexing abstractions are useful if your document formats are complex.
* **Plain httpx + FastAPI**: Maximum control, minimum magic. Everything is explicit and debuggable.

### My Decision Framework

```
Do you need advanced features that would take weeks to build?
  → Yes: Use a framework (fine-tuning orchestration, multi-modal, complex agents)
  → No: Continue

Do you need to deeply understand the behavior of your system?
  → Yes: Build from scratch (you'll debug it eventually anyway)
  → No: Continue

Is this a prototype or proof of concept?
  → Yes: Use a framework (speed matters)
  → No: Build from scratch (maintenance cost matters)
```

For my personal projects, I build from scratch. The code in this series uses only httpx, FastAPI, sentence-transformers, and SQLAlchemy. No LangChain, no LlamaIndex, no orchestration frameworks. I understand every line, and when something breaks, I know where to look.

For a team project with tight deadlines, I'd start with LangChain or LlamaIndex and migrate critical paths to custom code as the system matures.

***

## Deployment

My AI services deploy the same way as any other Python service:

### Dockerfile

```dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Install dependencies
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

# Copy application
COPY src/ src/

EXPOSE 8000

CMD ["uv", "run", "uvicorn", "ai_engineer.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

### docker-compose.yml for Production

```yaml
services:
  app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql+asyncpg://postgres:${DB_PASSWORD}@postgres:5432/ai_engineer
      - LLM_API_KEY=${LLM_API_KEY}
      - LLM_PROVIDER=github
      - LLM_MODEL=gpt-4o
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: ai_engineer
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

volumes:
  pgdata:
```

### Health Check Endpoint

```python
@app.get("/health")
async def health() -> dict:
    """Health check for load balancers and container orchestration."""
    # Check database connectivity
    try:
        async with async_session() as session:
            await session.execute(text("SELECT 1"))
        db_status = "ok"
    except Exception:
        db_status = "error"

    # Check LLM provider connectivity
    try:
        await app.state.llm.generate("ping", max_tokens=5, temperature=0)
        llm_status = "ok"
    except Exception:
        llm_status = "error"

    status = "ok" if db_status == "ok" and llm_status == "ok" else "degraded"

    return {
        "status": status,
        "database": db_status,
        "llm_provider": llm_status,
        "model": settings.llm_model,
    }
```

***

## What I'd Do Differently

Looking back at my AI engineering journey, here's what I'd change if I started over:

1. **Start with evaluation.** I built the system first and added evaluation later. Evaluation should come first — even a simple dataset of 10 questions with expected answers would have caught issues I spent hours debugging.
2. **Log everything from day one.** Every prompt, every response, every latency measurement. I lost valuable data from early usage because I didn't set up structured logging until later.
3. **Use `temperature=0` by default.** I wasted time debugging inconsistent outputs that were just sampling randomness. Start with deterministic outputs and add temperature only when you need variation.
4. **Build the provider abstraction early.** I hardcoded the OpenAI client in my first project. When I wanted to try Claude, I had to refactor everything. The Protocol-based abstraction from Part 2 takes 10 minutes to set up and saves hours later.
5. **Don't optimize prematurely.** My first RAG system embedded 50,000 chunks with a complex indexing pipeline. I could have started with 500 and learned 90% of what I needed at 1% of the complexity.

***

## The Complete Picture

Across this series, we've built every layer of an AI-powered system:

```
┌─────────────────────────────────────────────────────┐
│                   Production                         │
│  Observability │ Guardrails │ Caching │ Deployment   │  ← Part 8
├─────────────────────────────────────────────────────┤
│                   Evaluation                         │
│  Retrieval eval │ LLM-as-judge │ Regression tests   │  ← Part 7
├─────────────────────────────────────────────────────┤
│                   API Layer                          │
│  FastAPI │ Streaming │ Rate limiting │ Cost control  │  ← Part 6
├─────────────────────────────────────────────────────┤
│                   Prompt Engineering                 │
│  System prompts │ Templates │ Structured output      │  ← Part 5
├─────────────────────────────────────────────────────┤
│               Embeddings & Vector Search             │
│  sentence-transformers │ pgvector │ Semantic search  │  ← Part 4
├─────────────────────────────────────────────────────┤
│                   LLM Understanding                  │
│  Tokens │ Context windows │ Temperature │ Providers  │  ← Part 3
├─────────────────────────────────────────────────────┤
│                   Python Tooling                     │
│  uv │ FastAPI │ Pydantic │ Ruff │ mypy │ pytest     │  ← Part 2
├─────────────────────────────────────────────────────┤
│                   Foundations                        │
│  Role definition │ Skills map │ Learning path        │  ← Part 1
└─────────────────────────────────────────────────────┘
```

Every piece is something I've built, used, and debugged in my own projects. The code runs. The patterns work. The problems I discussed are problems I actually encountered.

AI engineering is still a young field. The tools, models, and best practices evolve fast. But the fundamentals — software engineering discipline, understanding your tools, testing what you build, and measuring what matters — don't change.

***

**Previous:** [**Part 7 — Evaluating and Testing AI Systems**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-7-evaluating-ai-systems)

**Back to:** [**AI Engineer 101 — Series Overview**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101)
