Part 6: Building AI-Powered APIs with FastAPI

From Script to Service

Every AI project I've built started as a Python script. A few httpx.post() calls, some print() statements, and a proof of concept that runs once. The hard part was always the same: turning that script into a service that other systems (or users) can call reliably.

FastAPI is where I do that work. It gives me async by default (essential for I/O-bound LLM calls), automatic request/response validation via Pydantic, and OpenAPI documentation without extra effort. This article walks through the patterns I use to build AI-powered APIs — from endpoint design to streaming responses and cost control.


Designing AI Endpoints

The first question for any AI-powered endpoint: what's the contract? LLM responses are non-deterministic, but the API contract should be predictable.

Here's the core endpoint from my RAG service:

# src/ai_engineer/main.py
from contextlib import asynccontextmanager
from collections.abc import AsyncGenerator

from fastapi import FastAPI, HTTPException

from ai_engineer.config import settings
from ai_engineer.db.engine import init_db, close_db
from ai_engineer.models import QuestionRequest, AnswerResponse
from ai_engineer.retrieval.search import semantic_search
from ai_engineer.prompts.templates import PromptBuilder, RAGPromptInput
from ai_engineer.llm.factory import create_llm_provider


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    await init_db()
    app.state.llm = create_llm_provider()
    yield
    await close_db()


app = FastAPI(title="AI Engineer Service", lifespan=lifespan)


@app.post("/ask", response_model=AnswerResponse)
async def ask_question(request: QuestionRequest) -> AnswerResponse:
    """Answer a question using RAG: retrieve context, then generate."""
    import time

    start = time.monotonic()

    # Step 1: Retrieve relevant context
    chunks = await semantic_search(
        query=request.question,
        top_k=5,
        min_similarity=0.3,
    )

    if not chunks:
        return AnswerResponse(
            answer="I don't have information about this in my knowledge base.",
            sources=[],
            model=settings.llm_model,
            tokens_used=0,
            latency_ms=0,
        )

    # Step 2: Build prompt
    prompt_input = RAGPromptInput(
        question=request.question,
        context_chunks=chunks,
    )
    messages = PromptBuilder.build_rag_prompt(prompt_input)

    # Step 3: Generate answer
    llm = app.state.llm
    answer = await llm.generate(
        messages[-1]["content"],
        max_tokens=request.max_tokens,
        temperature=0.1,
    )

    elapsed_ms = (time.monotonic() - start) * 1000

    # Step 4: Build response
    sources = [
        {"title": c["title"], "content_preview": c["content"][:200], "similarity_score": c["similarity"]}
        for c in chunks
    ]

    return AnswerResponse(
        answer=answer,
        sources=sources,
        model=settings.llm_model,
        tokens_used=0,  # Will track properly in Part 8
        latency_ms=round(elapsed_ms, 1),
    )

The Request/Response Contract

Things I learned about AI API design:

  1. Always return metadata. model, tokens_used, and latency_ms are essential for debugging and cost tracking. Every response should tell the caller what happened behind the scenes.

  2. Return sources separately. Don't embed source references in the answer text. Returning them as structured data lets the frontend render them however it wants.

  3. Set sensible defaults. max_tokens=512 is enough for most answers. Making the caller specify it every time is busywork.


Streaming Responses with SSE

LLM responses can take several seconds. Returning the complete response means the user stares at a loading spinner. Streaming with Server-Sent Events (SSE) gives the user feedback immediately:

Consuming the Stream (Client Side)

Output appears token by token:


Async Patterns for Concurrent LLM Calls

When a single request needs multiple LLM calls (e.g., generating an answer and then evaluating it), I use asyncio.gather to run them concurrently:

Handling Multiple Independent Requests

When you need to embed multiple texts and search across multiple indexes:


Rate Limiting and Cost Control

LLM APIs charge per token, and a runaway loop or burst of traffic can generate a surprising bill. I build rate limiting into every AI service:

Cost Tracking

I add a /metrics endpoint that returns the cost summary:


Retry Logic

LLM APIs fail. Networks time out. Rate limits get hit. Retry logic is essential:


Putting It All Together

Here's the complete application wiring:


Key Takeaways

  1. Define contracts first. Pydantic request/response models are the API contract. Design them before writing the endpoint logic.

  2. Stream long responses. Users tolerate 5 seconds of streaming tokens. They don't tolerate 5 seconds of a loading spinner.

  3. Use asyncio.gather for independent operations. When you need multiple LLM calls or searches, run them concurrently.

  4. Build rate limiting and cost tracking from day one. These are harder to add later, and a runaway script calling your API can generate real costs.

  5. Retry with exponential backoff. LLM APIs are not 100% reliable. Handle transient failures gracefully.


Previous: Part 5 — Prompt Engineering for Production Systems

Next: Part 7 — Evaluating and Testing AI Systems

Last updated