Article 6: Prompt Construction and the Generation Layer

Introduction

Retrieval gives us the relevant chunks. Generation turns those chunks into a readable answer.

The generation layer has three jobs:

Assemble the retrieved chunks into a structured prompt
Call the LLM with that prompt
Return the response along with enough sourcing metadata for the caller to verify the answer

The difference between a RAG system that's trustworthy and one that hallucinates freely is almost entirely in how the prompt is constructed.

The Prompt Template

The prompt has four components:

System prompt: Defines the LLM's role and the grounding constraint
Context block: The retrieved chunks, formatted with source labels
User question: The original query, exactly as typed
Output instruction: How to format the response

# src/generation/prompt_builder.py
from src.retrieval.models import RetrievalResult, RetrievedChunk

SYSTEM_PROMPT = """\
You are a technical assistant for a personal knowledge base. 
Answer the user's question using ONLY the information provided in the context sections below.

Rules:
- If the context contains the answer, provide it clearly and concisely.
- If the context does not contain enough information to answer the question fully, say so explicitly. Do not guess or use knowledge outside the provided context.
- When relevant, cite the source document (e.g. "According to [file_path]...").
- Do not invent facts, configurations, commands, or code that is not present in the context.
- If multiple context sections are relevant, synthesize them into a coherent answer.
"""

def build_prompt(
    question: str,
    retrieval: RetrievalResult,
    max_context_tokens: int = 6000,
) -> tuple[str, list[RetrievedChunk]]:
    """
    Build the user message for the LLM.
    
    Returns:
        - The user message string
        - The list of chunks actually included (may be trimmed due to budget)
    """
    included_chunks = _trim_to_budget(retrieval.chunks, max_context_tokens)
    
    context_sections = []
    for i, chunk in enumerate(included_chunks):
        label = f"[{i+1}] {chunk.file_path}"
        if chunk.title:
            label += f" — {chunk.title}"
        context_sections.append(f"### Context {label}\n{chunk.content}")
    
    context_block = "\n\n".join(context_sections)
    
    user_message = f"""\
{context_block}

---

Question: {question}

Answer based on the context above. If citing a source, reference it by its label (e.g., [1], [2]).
"""
    
    return user_message, included_chunks

def _trim_to_budget(
    chunks: list[RetrievedChunk],
    max_tokens: int,
) -> list[RetrievedChunk]:
    """Keep top-ranked chunks that fit within the token budget."""
    included = []
    used_tokens = 0
    for chunk in chunks:
        if used_tokens + chunk.tokens > max_tokens:
            break
        included.append(chunk)
        used_tokens += chunk.tokens
    return included

Why the Grounding Constraint Matters

Without "Answer using ONLY the information provided", the LLM blends retrieved content with its parametric knowledge (what it learned during training). For a personal knowledge base, this is a problem: the LLM might "helpfully" supplement a correct answer from the docs with outdated or incorrect general knowledge.

The explicit "do not use knowledge outside the provided context" instruction makes the system's knowledge boundary clear to both the LLM and the user.

Context Window Budget Management

GPT-4o has a 128k token context window, which is large enough that I rarely hit it. But I still manage the budget explicitly because:

Cost — more tokens = higher API cost. I don't want to send 20,000 tokens of context when 3,000 suffice.
LLM accuracy — very long contexts increase the chance the LLM loses track of information in the middle.
Portability — smaller models (like Llama-3-8B locally) have 8k context limits. Staying within budget by default makes the code portable.

My budget allocation:

Slot

Budget

System prompt

~300 tokens (fixed)

Retrieved context

6,000 tokens

User question

~100 tokens

Response (output)

~1,500 tokens

Total

~8,000 tokens

This fits within any current model's context window while leaving room for verbose answers.

The LLM Client

Same GitHub Models API client pattern as in the RCA engine from the AIOps series:

# src/generation/llm_client.py
from openai import AsyncOpenAI
import os

class LLMClient:
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self._client = AsyncOpenAI(
            base_url="https://models.inference.ai.azure.com",
            api_key=os.environ["GITHUB_TOKEN"],
        )
    
    async def complete(
        self,
        system_prompt: str,
        user_message: str,
        temperature: float = 0.1,
        max_tokens: int = 1500,
    ) -> str:
        response = await self._client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_message},
            ],
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return response.choices[0].message.content or ""

temperature=0.1 is intentional for a knowledge base query — I want deterministic, fact-based answers rather than creative variation. This is a lookup tool, not a creative writing assistant.

Streaming Responses

For the HTTP API, I support streaming so the client starts seeing words before the full response is assembled. This matters noticeably at ~5–10 seconds LLM latency — a streaming response feels interactive; a 10-second blank wait feels broken.

# src/generation/llm_client.py (streaming variant)
from typing import AsyncIterator

async def stream(
    self,
    system_prompt: str,
    user_message: str,
    temperature: float = 0.1,
    max_tokens: int = 1500,
) -> AsyncIterator[str]:
    """
    Yield text chunks as they arrive from the LLM.
    Caller is responsible for assembling the full response if needed.
    """
    stream = await self._client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_message},
        ],
        temperature=temperature,
        max_tokens=max_tokens,
        stream=True,
    )
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

The FastAPI endpoint wraps this in a StreamingResponse (covered in Article 7).

Source Attribution

The API response includes the sources that were used to generate the answer. This lets the caller (or the UI) show "based on: [link to article]".

# src/generation/prompt_builder.py
from dataclasses import dataclass

@dataclass
class Source:
    index: int        # [1], [2], etc. — matches the label in the prompt
    file_path: str
    title: str | None
    similarity: float
    chunk_id: int

def extract_sources(chunks: list[RetrievedChunk]) -> list[Source]:
    return [
        Source(
            index=i + 1,
            file_path=chunk.file_path,
            title=chunk.title,
            similarity=chunk.similarity,
            chunk_id=chunk.chunk_id,
        )
        for i, chunk in enumerate(chunks)
    ]

The caller can then display something like:

Answer: To configure TLS on a Kubernetes ingress, you need a cert-manager
issuer and an ingress annotation...

Sources:
  [1] artificial-intelligence/aiops-101/README.md (similarity: 0.87)
  [2] architecture-and-patterns/software-architecture-101/... (similarity: 0.74)

The Generation Response Model

# src/generation/models.py
from dataclasses import dataclass, field
from src.generation.prompt_builder import Source

@dataclass
class GenerationResult:
    question:       str
    answer:         str
    sources:        list[Source]
    model_used:     str
    retrieval_strategy: str
    chunks_used:    int
    context_tokens: int
    answer_tokens:  int    # Approximate
    latency_ms:     float

This gets serialized directly to the JSON response body.

Full Generation Implementation

# src/generation/generator.py
import asyncio
import time
from src.retrieval.models import RetrievalResult
from src.generation.prompt_builder import build_prompt, extract_sources, SYSTEM_PROMPT
from src.generation.llm_client import LLMClient
from src.generation.models import GenerationResult
import structlog

log = structlog.get_logger()

class Generator:
    def __init__(self, llm: LLMClient, max_context_tokens: int = 6000):
        self.llm = llm
        self.max_context_tokens = max_context_tokens
    
    async def generate(
        self,
        retrieval: RetrievalResult,
    ) -> GenerationResult:
        start = time.monotonic()
        
        if not retrieval.chunks:
            # No relevant context found — tell the user rather than hallucinating
            return GenerationResult(
                question=retrieval.query,
                answer="I couldn't find relevant information in the knowledge base to answer this question. Try rephrasing or checking if the topic has been documented.",
                sources=[],
                model_used=self.llm.model,
                retrieval_strategy=retrieval.strategy,
                chunks_used=0,
                context_tokens=0,
                answer_tokens=0,
                latency_ms=0.0,
            )
        
        user_message, included_chunks = build_prompt(
            question=retrieval.query,
            retrieval=retrieval,
            max_context_tokens=self.max_context_tokens,
        )
        
        log.info(
            "generation.start",
            chunks=len(included_chunks),
            context_tokens=sum(c.tokens for c in included_chunks),
        )
        
        answer = await self.llm.complete(
            system_prompt=SYSTEM_PROMPT,
            user_message=user_message,
        )
        
        latency_ms = (time.monotonic() - start) * 1000
        sources = extract_sources(included_chunks)
        
        log.info(
            "generation.complete",
            chunks_used=len(included_chunks),
            answer_len=len(answer),
            latency_ms=latency_ms,
        )
        
        return GenerationResult(
            question=retrieval.query,
            answer=answer,
            sources=sources,
            model_used=self.llm.model,
            retrieval_strategy=retrieval.strategy,
            chunks_used=len(included_chunks),
            context_tokens=sum(c.tokens for c in included_chunks),
            answer_tokens=int(len(answer) * 0.25),
            latency_ms=latency_ms,
        )

The "No Context" Path

When no relevant chunks are retrieved (all similarity scores below threshold), the generator returns an explicit "not found" response instead of proceeding without context.

Without this guard, the LLM would receive zero context but still produce a confident-sounding answer — drawn entirely from its training data. For a knowledge base tool, a "not found" is more honest and more useful than a hallucinated answer.

What I Learned

"Do not use knowledge outside the provided context" works, but imperfectly. LLMs will still sometimes blend in training knowledge, especially when the question is about something the model knows very well (like basic Python syntax). For those cases the answer is usually correct, but it's unverifiable. I added a debug mode that includes instruction violations in the response — flagging when the LLM cites something not in the retrieved context.

Low temperature doesn't mean low quality. I was worried that temperature=0.1 would make answers sound robotic. In practice, for factual technical questions the answers are clear and well-phrased. The LLM is reasoning from provided text, not generating creative output — temperature has less effect when the answer is constrained by context.

The "no context" message should explain what to try next. My original "I couldn't find relevant information" response was unhelpful. Adding "try rephrasing" and "check if the topic has been documented" gives the user something actionable. I also log the failed query so I know which gaps to fill in the knowledge base.

Context ordering affects answer quality. I order chunks by similarity score descending (highest similarity first) when assembling the context block. This puts the most relevant information early in the context, where LLM attention is strongest. Reversing the order measurably degraded answer quality on my test set — the LLM weighted later context more heavily even when earlier context was more relevant.

Next: Article 7 — Wrapping Everything in a FastAPI Service

PreviousArticle 5: Semantic Search, Cosine Similarity, and Hybrid Retrieval NextArticle 7: Wrapping Everything in a FastAPI Service

Last updated 21 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagThe Prompt Template

hashtagWhy the Grounding Constraint Matters

hashtagContext Window Budget Management

hashtagThe LLM Client

hashtagStreaming Responses

hashtagSource Attribution

hashtagThe Generation Response Model

hashtagFull Generation Implementation

hashtagThe "No Context" Path

hashtagWhat I Learned