Article 6: Prompt Construction and the Generation Layer
Introduction
Retrieval gives us the relevant chunks. Generation turns those chunks into a readable answer.
The generation layer has three jobs:
Assemble the retrieved chunks into a structured prompt
Call the LLM with that prompt
Return the response along with enough sourcing metadata for the caller to verify the answer
The difference between a RAG system that's trustworthy and one that hallucinates freely is almost entirely in how the prompt is constructed.
Table of Contents
The Prompt Template
The prompt has four components:
System prompt: Defines the LLM's role and the grounding constraint
Context block: The retrieved chunks, formatted with source labels
User question: The original query, exactly as typed
Output instruction: How to format the response
Why the Grounding Constraint Matters
Without "Answer using ONLY the information provided", the LLM blends retrieved content with its parametric knowledge (what it learned during training). For a personal knowledge base, this is a problem: the LLM might "helpfully" supplement a correct answer from the docs with outdated or incorrect general knowledge.
The explicit "do not use knowledge outside the provided context" instruction makes the system's knowledge boundary clear to both the LLM and the user.
Context Window Budget Management
GPT-4o has a 128k token context window, which is large enough that I rarely hit it. But I still manage the budget explicitly because:
Cost β more tokens = higher API cost. I don't want to send 20,000 tokens of context when 3,000 suffice.
LLM accuracy β very long contexts increase the chance the LLM loses track of information in the middle.
Portability β smaller models (like Llama-3-8B locally) have 8k context limits. Staying within budget by default makes the code portable.
My budget allocation:
System prompt
~300 tokens (fixed)
Retrieved context
6,000 tokens
User question
~100 tokens
Response (output)
~1,500 tokens
Total
~8,000 tokens
This fits within any current model's context window while leaving room for verbose answers.
The LLM Client
Same GitHub Models API client pattern as in the RCA engine from the AIOps series:
temperature=0.1 is intentional for a knowledge base query β I want deterministic, fact-based answers rather than creative variation. This is a lookup tool, not a creative writing assistant.
Streaming Responses
For the HTTP API, I support streaming so the client starts seeing words before the full response is assembled. This matters noticeably at ~5β10 seconds LLM latency β a streaming response feels interactive; a 10-second blank wait feels broken.
The FastAPI endpoint wraps this in a StreamingResponse (covered in Article 7).
Source Attribution
The API response includes the sources that were used to generate the answer. This lets the caller (or the UI) show "based on: [link to article]".
The caller can then display something like:
The Generation Response Model
This gets serialized directly to the JSON response body.
Full Generation Implementation
The "No Context" Path
When no relevant chunks are retrieved (all similarity scores below threshold), the generator returns an explicit "not found" response instead of proceeding without context.
Without this guard, the LLM would receive zero context but still produce a confident-sounding answer β drawn entirely from its training data. For a knowledge base tool, a "not found" is more honest and more useful than a hallucinated answer.
What I Learned
"Do not use knowledge outside the provided context" works, but imperfectly. LLMs will still sometimes blend in training knowledge, especially when the question is about something the model knows very well (like basic Python syntax). For those cases the answer is usually correct, but it's unverifiable. I added a debug mode that includes instruction violations in the response β flagging when the LLM cites something not in the retrieved context.
Low temperature doesn't mean low quality. I was worried that temperature=0.1 would make answers sound robotic. In practice, for factual technical questions the answers are clear and well-phrased. The LLM is reasoning from provided text, not generating creative output β temperature has less effect when the answer is constrained by context.
The "no context" message should explain what to try next. My original "I couldn't find relevant information" response was unhelpful. Adding "try rephrasing" and "check if the topic has been documented" gives the user something actionable. I also log the failed query so I know which gaps to fill in the knowledge base.
Context ordering affects answer quality. I order chunks by similarity score descending (highest similarity first) when assembling the context block. This puts the most relevant information early in the context, where LLM attention is strongest. Reversing the order measurably degraded answer quality on my test set β the LLM weighted later context more heavily even when earlier context was more relevant.
Next: Article 7 β Wrapping Everything in a FastAPI Service
Last updated