Article 7: Wrapping Everything in a FastAPI Service

Introduction

The previous six articles covered each component in isolation: pgvector setup, chunking, embeddings, retrieval, and generation. This article puts them all together into a running FastAPI service.

The service exposes three endpoint groups:

/ingest — submit files or directories for ingestion
/query — submit a question and receive a grounded answer
/health — service status and dependency checks

Everything is async throughout: file loading, database access, embedding calls, and LLM calls. The application uses FastAPI's lifespan to start the embedding background worker and cleanly shut it down.

Application Structure

src/
├── main.py                 # FastAPI app, lifespan, router registration
├── core/
│   ├── config.py           # Settings (Pydantic BaseSettings)
│   └── logging.py          # structlog configuration
├── db/
│   ├── base.py             # Async engine, session factory
│   └── models.py           # SQLAlchemy models
├── embeddings/
│   ├── base.py             # EmbeddingProvider ABC
│   ├── local.py            # sentence-transformers
│   └── github_models.py    # GitHub Models API
├── ingestion/
│   ├── loader.py           # File loader
│   ├── chunker.py          # Chunking strategies
│   └── pipeline.py         # Ingest orchestration + embedding worker
├── retrieval/
│   ├── models.py           # RetrievedChunk, RetrievalResult
│   ├── vector_search.py    # pgvector cosine search
│   ├── hybrid_search.py    # RRF fusion
│   └── retriever.py        # Retriever class
├── generation/
│   ├── models.py           # GenerationResult
│   ├── prompt_builder.py   # Prompt assembly
│   ├── llm_client.py       # GitHub Models API client
│   └── generator.py        # Generator class
└── api/
    ├── ingest.py           # /ingest routes
    ├── query.py            # /query routes
    └── health.py           # /health route

Configuration and Settings

# src/core/config.py
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import field_validator

class Settings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8")
    
    # Database
    database_url: str = "postgresql+asyncpg://rag:secret@localhost:5432/rag_db"
    db_echo: bool = False
    
    # Embedding
    embedding_provider: str = "local"   # "local" | "github_models"
    embedding_model: str = "all-MiniLM-L6-v2"
    embedding_dimensions: int = 384
    
    # LLM
    llm_model: str = "gpt-4o"
    github_token: str = ""              # Required for github_models provider
    
    # Retrieval
    retrieval_strategy: str = "hybrid"  # "vector" | "hybrid"
    retrieval_limit: int = 5
    min_similarity: float = 0.3
    max_context_tokens: int = 6000
    
    # Ingestion
    corpus_root: str = "/corpus"        # Directory to scan for documents
    embedding_worker_interval: int = 30  # Seconds between embedding sweeps
    
    # App
    log_level: str = "INFO"
    host: str = "0.0.0.0"
    port: int = 8000

settings = Settings()

FastAPI Lifespan and Dependency Injection

The lifespan function creates all shared objects at startup and tears them down on shutdown:

# src/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from src.core.config import settings
from src.core.logging import configure_logging
from src.db.base import engine, AsyncSessionLocal
from src.embeddings.local import LocalEmbeddingProvider
from src.embeddings.github_models import GitHubModelsEmbeddingProvider
from src.retrieval.retriever import Retriever, RetrievalStrategy
from src.generation.llm_client import LLMClient
from src.generation.generator import Generator
from src.ingestion.pipeline import embedding_worker
from src.api import ingest, query, health
import asyncio, structlog

log = structlog.get_logger()

@asynccontextmanager
async def lifespan(app: FastAPI):
    configure_logging(settings.log_level)
    log.info("rag_service.starting")
    
    # Build embedding provider
    if settings.embedding_provider == "github_models":
        embedder = GitHubModelsEmbeddingProvider(model=settings.embedding_model)
    else:
        embedder = LocalEmbeddingProvider(model_name=settings.embedding_model)
    
    # Build retriever and generator
    retriever = Retriever(
        embedder=embedder,
        default_strategy=RetrievalStrategy(settings.retrieval_strategy),
        default_limit=settings.retrieval_limit,
        min_similarity=settings.min_similarity,
    )
    generator = Generator(
        llm=LLMClient(model=settings.llm_model),
        max_context_tokens=settings.max_context_tokens,
    )
    
    # Attach to app state for route access
    app.state.embedder   = embedder
    app.state.retriever  = retriever
    app.state.generator  = generator
    app.state.db_factory = AsyncSessionLocal
    
    # Start embedding background worker
    worker_task = asyncio.create_task(
        embedding_worker(embedder, intervals_seconds=settings.embedding_worker_interval),
        name="embedding_worker",
    )
    app.state.worker_task = worker_task
    
    log.info("rag_service.ready", provider=settings.embedding_provider, model=settings.llm_model)
    yield
    
    # Shutdown
    worker_task.cancel()
    try:
        await worker_task
    except asyncio.CancelledError:
        pass
    await engine.dispose()
    log.info("rag_service.stopped")

app = FastAPI(title="RAG Service", lifespan=lifespan)
app.include_router(ingest.router)
app.include_router(query.router)
app.include_router(health.router)

Dependency Injection for Database Sessions

# src/db/base.py (dependency)
from fastapi import Request
from sqlalchemy.ext.asyncio import AsyncSession

async def get_db(request: Request) -> AsyncSession:
    async with request.app.state.db_factory() as session:
        yield session

Routes use Depends(get_db) to get a session scoped to the request lifetime.

Ingestion Endpoints

# src/api/ingest.py
from fastapi import APIRouter, Depends, BackgroundTasks, HTTPException, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from pathlib import Path
from sqlalchemy.ext.asyncio import AsyncSession
from src.db.base import get_db
from src.ingestion.pipeline import ingest_file, scan_corpus
import uuid, structlog

log = structlog.get_logger()
router = APIRouter(prefix="/ingest", tags=["ingestion"])

class IngestFileRequest(BaseModel):
    file_path: str    # Path relative to corpus_root, or absolute

class IngestDirectoryRequest(BaseModel):
    directory: str    # Path to scan recursively

@router.post("/file")
async def ingest_single_file(
    body: IngestFileRequest,
    background_tasks: BackgroundTasks,
    request: Request,
    db: AsyncSession = Depends(get_db),
):
    """Ingest a single file synchronously (chunks created; embedding is async)."""
    path = Path(body.file_path)
    if not path.exists():
        raise HTTPException(status_code=404, detail=f"File not found: {body.file_path}")
    
    corpus_root = Path(request.app.state.embedder.model_name).parent  # or settings.corpus_root
    chunks_created, skipped = await ingest_file(path, corpus_root, db)
    
    return {"status": "ok", "chunks_created": chunks_created, "skipped": bool(skipped)}

@router.post("/directory")
async def ingest_directory(
    body: IngestDirectoryRequest,
    background_tasks: BackgroundTasks,
    request: Request,
):
    """
    Scan a directory and ingest all documents as a background job.
    Returns a job ID immediately; progress is available via /ingest/job/{id}.
    """
    from src.db.models import IngestionJob
    from src.db.base import AsyncSessionLocal
    from src.ingestion.pipeline import ingest_file, scan_corpus

    directory = Path(body.directory)
    if not directory.is_dir():
        raise HTTPException(status_code=400, detail=f"Not a directory: {body.directory}")
    
    job_id = str(uuid.uuid4())
    
    async def run_ingestion():
        async with AsyncSessionLocal() as db:
            files = scan_corpus(directory)
            total_chunks = 0
            for f in files:
                try:
                    chunks, _ = await ingest_file(f, directory, db)
                    total_chunks += chunks
                except Exception as exc:
                    log.error("ingest.file.error", file=str(f), error=str(exc))
            log.info("ingest.directory.done", job_id=job_id, files=len(files), chunks=total_chunks)
    
    background_tasks.add_task(run_ingestion)
    
    return JSONResponse(
        {"status": "accepted", "job_id": job_id, "directory": body.directory},
        status_code=202,
    )

@router.get("/stats")
async def ingestion_stats(db: AsyncSession = Depends(get_db)):
    """Return document and chunk counts."""
    from sqlalchemy import text
    result = await db.execute(text("""
        SELECT
            (SELECT count(*) FROM documents) AS documents,
            (SELECT count(*) FROM chunks)    AS chunks,
            (SELECT count(*) FROM chunks WHERE embedding IS NULL) AS pending_embeddings
    """))
    row = result.one()
    return {"documents": row.documents, "chunks": row.chunks, "pending_embeddings": row.pending_embeddings}

Query Endpoint with Streaming

# src/api/query.py
from fastapi import APIRouter, Depends, Request
from fastapi.responses import StreamingResponse, JSONResponse
from pydantic import BaseModel, Field
from sqlalchemy.ext.asyncio import AsyncSession
from src.db.base import get_db
from src.generation.models import GenerationResult
import json, structlog

log = structlog.get_logger()
router = APIRouter(prefix="/query", tags=["query"])

class QueryRequest(BaseModel):
    question: str = Field(..., min_length=3, max_length=2000)
    limit: int = Field(5, ge=1, le=20)
    stream: bool = False
    strategy: str | None = None    # Override default retrieval strategy

@router.post("")
async def query(
    body: QueryRequest,
    request: Request,
    db: AsyncSession = Depends(get_db),
):
    retriever = request.app.state.retriever
    generator = request.app.state.generator
    
    # Step 1: Retrieve relevant chunks
    from src.retrieval.retriever import RetrievalStrategy
    strategy = RetrievalStrategy(body.strategy) if body.strategy else None
    retrieval = await retriever.retrieve(
        query=body.question,
        db=db,
        limit=body.limit,
        strategy=strategy,
    )
    
    if body.stream:
        return _stream_response(generator, retrieval, body.question)
    
    # Step 2: Generate answer
    result: GenerationResult = await generator.generate(retrieval)
    
    return {
        "question":  result.question,
        "answer":    result.answer,
        "sources":   [
            {
                "index":      s.index,
                "file_path":  s.file_path,
                "title":      s.title,
                "similarity": round(s.similarity, 4),
            }
            for s in result.sources
        ],
        "meta": {
            "model":              result.model_used,
            "retrieval_strategy": result.retrieval_strategy,
            "chunks_used":        result.chunks_used,
            "context_tokens":     result.context_tokens,
            "latency_ms":         round(result.latency_ms, 1),
        }
    }

def _stream_response(generator, retrieval, question: str) -> StreamingResponse:
    """Return a Server-Sent Events stream."""
    from src.generation.prompt_builder import build_prompt, extract_sources, SYSTEM_PROMPT
    
    async def event_stream():
        user_message, included_chunks = build_prompt(question, retrieval)
        sources = extract_sources(included_chunks)
        
        # First SSE event: sources metadata
        yield f"data: {json.dumps({'type': 'sources', 'sources': [{'index': s.index, 'file_path': s.file_path, 'title': s.title} for s in sources]})}\n\n"
        
        # Stream answer tokens
        async for token in generator.llm.stream(
            system_prompt=SYSTEM_PROMPT,
            user_message=user_message,
        ):
            yield f"data: {json.dumps({'type': 'token', 'content': token})}\n\n"
        
        # Final event: done signal
        yield f"data: {json.dumps({'type': 'done'})}\n\n"
    
    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

Example Query Request and Response

# Non-streaming query
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How does the watch-loop detect CrashLoopBackOff pods?"}'

{
  "question": "How does the watch-loop detect CrashLoopBackOff pods?",
  "answer": "The watch-loop detects CrashLoopBackOff pods by polling the Kubernetes pods API every 30 seconds and checking each pod's container statuses. Specifically, it calls `get_pod_status()` and examines the `waiting.reason` field — if it equals `CrashLoopBackOff` and the restart count exceeds the configured threshold, a `ClusterEvent` is emitted. According to [1], the detection function also captures the restart count and last exit code in the event metadata for use by the RCA engine.",
  "sources": [
    {"index": 1, "file_path": "artificial-intelligence/aiops-101/aiops-101-watch-loop.md", "title": "AIOps 101: The Watch-Loop", "similarity": 0.8923},
    {"index": 2, "file_path": "artificial-intelligence/aiops-101/aiops-101-rule-engine.md", "title": "AIOps 101: The Rule Engine", "similarity": 0.7814}
  ],
  "meta": {
    "model": "gpt-4o",
    "retrieval_strategy": "hybrid",
    "chunks_used": 2,
    "context_tokens": 743,
    "latency_ms": 7421.3
  }
}

Health Endpoint

# src/api/health.py
from fastapi import APIRouter, Request
from fastapi.responses import JSONResponse
from sqlalchemy import text
import structlog

log = structlog.get_logger()
router = APIRouter(tags=["health"])

@router.get("/health")
async def health(request: Request):
    checks = {}
    overall_ok = True
    
    # Database connectivity
    try:
        async with request.app.state.db_factory() as db:
            await db.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok"}
    except Exception as e:
        checks["database"] = {"status": "error", "detail": str(e)}
        overall_ok = False
    
    # pgvector extension
    try:
        async with request.app.state.db_factory() as db:
            result = await db.execute(text("SELECT extversion FROM pg_extension WHERE extname='vector'"))
            row = result.fetchone()
            if row:
                checks["pgvector"] = {"status": "ok", "version": row[0]}
            else:
                checks["pgvector"] = {"status": "error", "detail": "extension not found"}
                overall_ok = False
    except Exception as e:
        checks["pgvector"] = {"status": "error", "detail": str(e)}
        overall_ok = False
    
    # Embedding worker
    worker = getattr(request.app.state, "worker_task", None)
    if worker and not worker.done():
        checks["embedding_worker"] = {"status": "running"}
    else:
        checks["embedding_worker"] = {"status": "stopped"}
        overall_ok = False
    
    # Ingestion stats
    try:
        async with request.app.state.db_factory() as db:
            result = await db.execute(text(
                "SELECT count(*) FROM chunks WHERE embedding IS NULL"
            ))
            pending = result.scalar()
        checks["pending_embeddings"] = {"count": pending}
    except Exception:
        pass
    
    return JSONResponse(
        {"status": "ok" if overall_ok else "degraded", "checks": checks},
        status_code=200 if overall_ok else 503,
    )

@router.get("/")
async def root():
    return {"service": "rag-service", "docs": "/docs"}

Running the Service

Development

# Start PostgreSQL with pgvector
docker compose up -d postgres

# Run Alembic migrations
alembic upgrade head

# Start the service with hot reload
uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload

# Ingest the git-book corpus
curl -X POST http://localhost:8000/ingest/directory \
  -H "Content-Type: application/json" \
  -d '{"directory": "/path/to/git-book"}'

# Check ingestion progress
curl http://localhost:8000/ingest/stats

# Query
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I set up a Kubernetes ingress with TLS?"}'

Environment Variables

# .env
DATABASE_URL=postgresql+asyncpg://rag:secret@localhost:5432/rag_db
EMBEDDING_PROVIDER=local
EMBEDDING_MODEL=all-MiniLM-L6-v2
EMBEDDING_DIMENSIONS=384
LLM_MODEL=gpt-4o
GITHUB_TOKEN=ghp_...
RETRIEVAL_STRATEGY=hybrid
CORPUS_ROOT=/corpus
LOG_LEVEL=INFO

Docker Compose Setup

# config/docker-compose.yml
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER:     rag
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB:       rag_db
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U rag -d rag_db"]
      interval: 10s
      timeout: 5s
      retries: 5

  rag-service:
    build: .
    ports:
      - "8000:8000"
    environment:
      DATABASE_URL:       postgresql+asyncpg://rag:${POSTGRES_PASSWORD}@postgres:5432/rag_db
      EMBEDDING_PROVIDER: local
      LLM_MODEL:          gpt-4o
      GITHUB_TOKEN:       ${GITHUB_TOKEN}
      CORPUS_ROOT:        /corpus
    volumes:
      - ${CORPUS_PATH}:/corpus:ro   # Mount the markdown corpus read-only
    depends_on:
      postgres:
        condition: service_healthy

volumes:
  pgdata:

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install uv for fast dependency resolution
RUN pip install uv

COPY pyproject.toml .
RUN uv pip install --system --no-cache .

COPY . .

# sentence-transformers downloads models on first use — pre-download at build time
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

# Non-root user
RUN useradd -m -u 1000 app && chown -R app /app
USER app

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Pre-downloading the model at Docker build time avoids the 5-second download delay on first request — important for keeping startup time predictable.

What I Learned

The streaming response header X-Accel-Buffering: no is essential behind nginx. Without it, nginx buffers SSE events and the client sees them in one burst at the end rather than as a stream. I lost two hours debugging "why doesn't streaming work" before finding this header.

Background tasks in FastAPI have a gotcha with database sessions. If a background task tries to reuse the request-scoped database session, it will fail because the session is closed when the response is sent. The run_ingestion function opens its own session via AsyncSessionLocal() instead of using the request session.

/ingest/stats is more useful than I expected. I check pending_embeddings regularly — it tells me how far behind the embedding worker is. If the worker is struggling (high pending count that's not decreasing), it's usually because the local model is CPU-bound and I have 20+ files queued. Adding a metric for this (PENDING_EMBEDDINGS = Gauge(...)) was a natural next step.

The /docs FastAPI auto-docs page is genuinely useful for a personal tool. Since it's just me using this service, the Swagger UI at /docs serves as both documentation and a testing interface. I don't need a separate frontend for most tasks.

Wrapping Up the Series

This completes the RAG 101 series. In seven articles, we've built a complete RAG system from scratch:

What is RAG? — Motivation and end-to-end concept
pgvector Setup — PostgreSQL extension, schema, HNSW index
Chunking Strategies — Markdown-aware splitting with sentence fallback
Embeddings — Local and API-based embedding, batching, background worker
Retrieval — Vector search, full-text search, RRF hybrid fusion
Generation — Prompt construction, grounding constraint, streaming
FastAPI Service (this article) — Putting it all together

The full stack: Python 3.12, FastAPI, PostgreSQL 16, pgvector, sentence-transformers, SQLAlchemy 2 async, GitHub Models API.

PreviousArticle 6: Prompt Construction and the Generation Layer NextMLOps 101

Last updated 21 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagApplication Structure

hashtagConfiguration and Settings

hashtagFastAPI Lifespan and Dependency Injection

hashtagDependency Injection for Database Sessions

hashtagIngestion Endpoints

hashtagQuery Endpoint with Streaming

hashtagExample Query Request and Response

hashtagHealth Endpoint

hashtagRunning the Service

hashtagDevelopment

hashtagEnvironment Variables

hashtagDocker Compose Setup

hashtagWhat I Learned

hashtagWrapping Up the Series