Article 6: Prompt Construction and the Generation Layer

Introduction

Retrieval gives us the relevant chunks. Generation turns those chunks into a readable answer.

The generation layer has three jobs:

  1. Assemble the retrieved chunks into a structured prompt

  2. Call the LLM with that prompt

  3. Return the response along with enough sourcing metadata for the caller to verify the answer

The difference between a RAG system that's trustworthy and one that hallucinates freely is almost entirely in how the prompt is constructed.


Table of Contents


The Prompt Template

The prompt has four components:

  1. System prompt: Defines the LLM's role and the grounding constraint

  2. Context block: The retrieved chunks, formatted with source labels

  3. User question: The original query, exactly as typed

  4. Output instruction: How to format the response

Why the Grounding Constraint Matters

Without "Answer using ONLY the information provided", the LLM blends retrieved content with its parametric knowledge (what it learned during training). For a personal knowledge base, this is a problem: the LLM might "helpfully" supplement a correct answer from the docs with outdated or incorrect general knowledge.

The explicit "do not use knowledge outside the provided context" instruction makes the system's knowledge boundary clear to both the LLM and the user.


Context Window Budget Management

GPT-4o has a 128k token context window, which is large enough that I rarely hit it. But I still manage the budget explicitly because:

  1. Cost β€” more tokens = higher API cost. I don't want to send 20,000 tokens of context when 3,000 suffice.

  2. LLM accuracy β€” very long contexts increase the chance the LLM loses track of information in the middle.

  3. Portability β€” smaller models (like Llama-3-8B locally) have 8k context limits. Staying within budget by default makes the code portable.

My budget allocation:

Slot
Budget

System prompt

~300 tokens (fixed)

Retrieved context

6,000 tokens

User question

~100 tokens

Response (output)

~1,500 tokens

Total

~8,000 tokens

This fits within any current model's context window while leaving room for verbose answers.


The LLM Client

Same GitHub Models API client pattern as in the RCA engine from the AIOps series:

temperature=0.1 is intentional for a knowledge base query β€” I want deterministic, fact-based answers rather than creative variation. This is a lookup tool, not a creative writing assistant.


Streaming Responses

For the HTTP API, I support streaming so the client starts seeing words before the full response is assembled. This matters noticeably at ~5–10 seconds LLM latency β€” a streaming response feels interactive; a 10-second blank wait feels broken.

The FastAPI endpoint wraps this in a StreamingResponse (covered in Article 7).


Source Attribution

The API response includes the sources that were used to generate the answer. This lets the caller (or the UI) show "based on: [link to article]".

The caller can then display something like:


The Generation Response Model

This gets serialized directly to the JSON response body.


Full Generation Implementation

The "No Context" Path

When no relevant chunks are retrieved (all similarity scores below threshold), the generator returns an explicit "not found" response instead of proceeding without context.

Without this guard, the LLM would receive zero context but still produce a confident-sounding answer β€” drawn entirely from its training data. For a knowledge base tool, a "not found" is more honest and more useful than a hallucinated answer.


What I Learned

"Do not use knowledge outside the provided context" works, but imperfectly. LLMs will still sometimes blend in training knowledge, especially when the question is about something the model knows very well (like basic Python syntax). For those cases the answer is usually correct, but it's unverifiable. I added a debug mode that includes instruction violations in the response β€” flagging when the LLM cites something not in the retrieved context.

Low temperature doesn't mean low quality. I was worried that temperature=0.1 would make answers sound robotic. In practice, for factual technical questions the answers are clear and well-phrased. The LLM is reasoning from provided text, not generating creative output β€” temperature has less effect when the answer is constrained by context.

The "no context" message should explain what to try next. My original "I couldn't find relevant information" response was unhelpful. Adding "try rephrasing" and "check if the topic has been documented" gives the user something actionable. I also log the failed query so I know which gaps to fill in the knowledge base.

Context ordering affects answer quality. I order chunks by similarity score descending (highest similarity first) when assembling the context block. This puts the most relevant information early in the context, where LLM attention is strongest. Reversing the order measurably degraded answer quality on my test set β€” the LLM weighted later context more heavily even when earlier context was more relevant.


Next: Article 7 β€” Wrapping Everything in a FastAPI Service

Last updated