Article 7: Wrapping Everything in a FastAPI Service

Introduction

The previous six articles covered each component in isolation: pgvector setup, chunking, embeddings, retrieval, and generation. This article puts them all together into a running FastAPI service.

The service exposes three endpoint groups:

  • /ingest β€” submit files or directories for ingestion

  • /query β€” submit a question and receive a grounded answer

  • /health β€” service status and dependency checks

Everything is async throughout: file loading, database access, embedding calls, and LLM calls. The application uses FastAPI's lifespan to start the embedding background worker and cleanly shut it down.


Table of Contents


Application Structure


Configuration and Settings


FastAPI Lifespan and Dependency Injection

The lifespan function creates all shared objects at startup and tears them down on shutdown:

Dependency Injection for Database Sessions

Routes use Depends(get_db) to get a session scoped to the request lifetime.


Ingestion Endpoints


Query Endpoint with Streaming

Example Query Request and Response


Health Endpoint


Running the Service

Development

Environment Variables


Docker Compose Setup

Pre-downloading the model at Docker build time avoids the 5-second download delay on first request β€” important for keeping startup time predictable.


What I Learned

The streaming response header X-Accel-Buffering: no is essential behind nginx. Without it, nginx buffers SSE events and the client sees them in one burst at the end rather than as a stream. I lost two hours debugging "why doesn't streaming work" before finding this header.

Background tasks in FastAPI have a gotcha with database sessions. If a background task tries to reuse the request-scoped database session, it will fail because the session is closed when the response is sent. The run_ingestion function opens its own session via AsyncSessionLocal() instead of using the request session.

/ingest/stats is more useful than I expected. I check pending_embeddings regularly β€” it tells me how far behind the embedding worker is. If the worker is struggling (high pending count that's not decreasing), it's usually because the local model is CPU-bound and I have 20+ files queued. Adding a metric for this (PENDING_EMBEDDINGS = Gauge(...)) was a natural next step.

The /docs FastAPI auto-docs page is genuinely useful for a personal tool. Since it's just me using this service, the Swagger UI at /docs serves as both documentation and a testing interface. I don't need a separate frontend for most tasks.


Wrapping Up the Series

This completes the RAG 101 series. In seven articles, we've built a complete RAG system from scratch:

  1. What is RAG? β€” Motivation and end-to-end concept

  2. pgvector Setup β€” PostgreSQL extension, schema, HNSW index

  3. Chunking Strategies β€” Markdown-aware splitting with sentence fallback

  4. Embeddings β€” Local and API-based embedding, batching, background worker

  5. Retrieval β€” Vector search, full-text search, RRF hybrid fusion

  6. Generation β€” Prompt construction, grounding constraint, streaming

  7. FastAPI Service (this article) β€” Putting it all together

The full stack: Python 3.12, FastAPI, PostgreSQL 16, pgvector, sentence-transformers, SQLAlchemy 2 async, GitHub Models API.

Last updated