# Article 6: Prompt Construction and the Generation Layer

## Introduction

Retrieval gives us the relevant chunks. Generation turns those chunks into a readable answer.

The generation layer has three jobs:

1. Assemble the retrieved chunks into a structured prompt
2. Call the LLM with that prompt
3. Return the response along with enough sourcing metadata for the caller to verify the answer

The difference between a RAG system that's trustworthy and one that hallucinates freely is almost entirely in how the prompt is constructed.

***

## Table of Contents

1. [The Prompt Template](#prompt-template)
2. [Context Window Budget Management](#context-budget)
3. [The LLM Client](#llm-client)
4. [Streaming Responses](#streaming)
5. [Source Attribution](#source-attribution)
6. [The Generation Response Model](#response-model)
7. [Full Generation Implementation](#full-implementation)
8. [What I Learned](#what-i-learned)

***

## The Prompt Template <a href="#prompt-template" id="prompt-template"></a>

The prompt has four components:

1. **System prompt**: Defines the LLM's role and the grounding constraint
2. **Context block**: The retrieved chunks, formatted with source labels
3. **User question**: The original query, exactly as typed
4. **Output instruction**: How to format the response

```python
# src/generation/prompt_builder.py
from src.retrieval.models import RetrievalResult, RetrievedChunk

SYSTEM_PROMPT = """\
You are a technical assistant for a personal knowledge base. 
Answer the user's question using ONLY the information provided in the context sections below.

Rules:
- If the context contains the answer, provide it clearly and concisely.
- If the context does not contain enough information to answer the question fully, say so explicitly. Do not guess or use knowledge outside the provided context.
- When relevant, cite the source document (e.g. "According to [file_path]...").
- Do not invent facts, configurations, commands, or code that is not present in the context.
- If multiple context sections are relevant, synthesize them into a coherent answer.
"""

def build_prompt(
    question: str,
    retrieval: RetrievalResult,
    max_context_tokens: int = 6000,
) -> tuple[str, list[RetrievedChunk]]:
    """
    Build the user message for the LLM.
    
    Returns:
        - The user message string
        - The list of chunks actually included (may be trimmed due to budget)
    """
    included_chunks = _trim_to_budget(retrieval.chunks, max_context_tokens)
    
    context_sections = []
    for i, chunk in enumerate(included_chunks):
        label = f"[{i+1}] {chunk.file_path}"
        if chunk.title:
            label += f" — {chunk.title}"
        context_sections.append(f"### Context {label}\n{chunk.content}")
    
    context_block = "\n\n".join(context_sections)
    
    user_message = f"""\
{context_block}

---

Question: {question}

Answer based on the context above. If citing a source, reference it by its label (e.g., [1], [2]).
"""
    
    return user_message, included_chunks

def _trim_to_budget(
    chunks: list[RetrievedChunk],
    max_tokens: int,
) -> list[RetrievedChunk]:
    """Keep top-ranked chunks that fit within the token budget."""
    included = []
    used_tokens = 0
    for chunk in chunks:
        if used_tokens + chunk.tokens > max_tokens:
            break
        included.append(chunk)
        used_tokens += chunk.tokens
    return included
```

### Why the Grounding Constraint Matters

Without `"Answer using ONLY the information provided"`, the LLM blends retrieved content with its parametric knowledge (what it learned during training). For a personal knowledge base, this is a problem: the LLM might "helpfully" supplement a correct answer from the docs with outdated or incorrect general knowledge.

The explicit "do not use knowledge outside the provided context" instruction makes the system's knowledge boundary clear to both the LLM and the user.

***

## Context Window Budget Management <a href="#context-budget" id="context-budget"></a>

GPT-4o has a 128k token context window, which is large enough that I rarely hit it. But I still manage the budget explicitly because:

1. Cost — more tokens = higher API cost. I don't want to send 20,000 tokens of context when 3,000 suffice.
2. LLM accuracy — very long contexts increase the chance the LLM loses track of information in the middle.
3. Portability — smaller models (like Llama-3-8B locally) have 8k context limits. Staying within budget by default makes the code portable.

My budget allocation:

| Slot              | Budget               |
| ----------------- | -------------------- |
| System prompt     | \~300 tokens (fixed) |
| Retrieved context | 6,000 tokens         |
| User question     | \~100 tokens         |
| Response (output) | \~1,500 tokens       |
| **Total**         | **\~8,000 tokens**   |

This fits within any current model's context window while leaving room for verbose answers.

***

## The LLM Client <a href="#llm-client" id="llm-client"></a>

Same GitHub Models API client pattern as in the RCA engine from the AIOps series:

```python
# src/generation/llm_client.py
from openai import AsyncOpenAI
import os

class LLMClient:
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self._client = AsyncOpenAI(
            base_url="https://models.inference.ai.azure.com",
            api_key=os.environ["GITHUB_TOKEN"],
        )
    
    async def complete(
        self,
        system_prompt: str,
        user_message: str,
        temperature: float = 0.1,
        max_tokens: int = 1500,
    ) -> str:
        response = await self._client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_message},
            ],
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return response.choices[0].message.content or ""
```

`temperature=0.1` is intentional for a knowledge base query — I want deterministic, fact-based answers rather than creative variation. This is a lookup tool, not a creative writing assistant.

***

## Streaming Responses <a href="#streaming" id="streaming"></a>

For the HTTP API, I support streaming so the client starts seeing words before the full response is assembled. This matters noticeably at \~5–10 seconds LLM latency — a streaming response feels interactive; a 10-second blank wait feels broken.

```python
# src/generation/llm_client.py (streaming variant)
from typing import AsyncIterator

async def stream(
    self,
    system_prompt: str,
    user_message: str,
    temperature: float = 0.1,
    max_tokens: int = 1500,
) -> AsyncIterator[str]:
    """
    Yield text chunks as they arrive from the LLM.
    Caller is responsible for assembling the full response if needed.
    """
    stream = await self._client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_message},
        ],
        temperature=temperature,
        max_tokens=max_tokens,
        stream=True,
    )
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta
```

The FastAPI endpoint wraps this in a `StreamingResponse` (covered in Article 7).

***

## Source Attribution <a href="#source-attribution" id="source-attribution"></a>

The API response includes the sources that were used to generate the answer. This lets the caller (or the UI) show "based on: \[link to article]".

```python
# src/generation/prompt_builder.py
from dataclasses import dataclass

@dataclass
class Source:
    index: int        # [1], [2], etc. — matches the label in the prompt
    file_path: str
    title: str | None
    similarity: float
    chunk_id: int

def extract_sources(chunks: list[RetrievedChunk]) -> list[Source]:
    return [
        Source(
            index=i + 1,
            file_path=chunk.file_path,
            title=chunk.title,
            similarity=chunk.similarity,
            chunk_id=chunk.chunk_id,
        )
        for i, chunk in enumerate(chunks)
    ]
```

The caller can then display something like:

```
Answer: To configure TLS on a Kubernetes ingress, you need a cert-manager
issuer and an ingress annotation...

Sources:
  [1] artificial-intelligence/aiops-101/README.md (similarity: 0.87)
  [2] architecture-and-patterns/software-architecture-101/... (similarity: 0.74)
```

***

## The Generation Response Model <a href="#response-model" id="response-model"></a>

```python
# src/generation/models.py
from dataclasses import dataclass, field
from src.generation.prompt_builder import Source

@dataclass
class GenerationResult:
    question:       str
    answer:         str
    sources:        list[Source]
    model_used:     str
    retrieval_strategy: str
    chunks_used:    int
    context_tokens: int
    answer_tokens:  int    # Approximate
    latency_ms:     float
```

This gets serialized directly to the JSON response body.

***

## Full Generation Implementation <a href="#full-implementation" id="full-implementation"></a>

```python
# src/generation/generator.py
import asyncio
import time
from src.retrieval.models import RetrievalResult
from src.generation.prompt_builder import build_prompt, extract_sources, SYSTEM_PROMPT
from src.generation.llm_client import LLMClient
from src.generation.models import GenerationResult
import structlog

log = structlog.get_logger()

class Generator:
    def __init__(self, llm: LLMClient, max_context_tokens: int = 6000):
        self.llm = llm
        self.max_context_tokens = max_context_tokens
    
    async def generate(
        self,
        retrieval: RetrievalResult,
    ) -> GenerationResult:
        start = time.monotonic()
        
        if not retrieval.chunks:
            # No relevant context found — tell the user rather than hallucinating
            return GenerationResult(
                question=retrieval.query,
                answer="I couldn't find relevant information in the knowledge base to answer this question. Try rephrasing or checking if the topic has been documented.",
                sources=[],
                model_used=self.llm.model,
                retrieval_strategy=retrieval.strategy,
                chunks_used=0,
                context_tokens=0,
                answer_tokens=0,
                latency_ms=0.0,
            )
        
        user_message, included_chunks = build_prompt(
            question=retrieval.query,
            retrieval=retrieval,
            max_context_tokens=self.max_context_tokens,
        )
        
        log.info(
            "generation.start",
            chunks=len(included_chunks),
            context_tokens=sum(c.tokens for c in included_chunks),
        )
        
        answer = await self.llm.complete(
            system_prompt=SYSTEM_PROMPT,
            user_message=user_message,
        )
        
        latency_ms = (time.monotonic() - start) * 1000
        sources = extract_sources(included_chunks)
        
        log.info(
            "generation.complete",
            chunks_used=len(included_chunks),
            answer_len=len(answer),
            latency_ms=latency_ms,
        )
        
        return GenerationResult(
            question=retrieval.query,
            answer=answer,
            sources=sources,
            model_used=self.llm.model,
            retrieval_strategy=retrieval.strategy,
            chunks_used=len(included_chunks),
            context_tokens=sum(c.tokens for c in included_chunks),
            answer_tokens=int(len(answer) * 0.25),
            latency_ms=latency_ms,
        )
```

### The "No Context" Path

When no relevant chunks are retrieved (all similarity scores below threshold), the generator returns an explicit "not found" response instead of proceeding without context.

Without this guard, the LLM would receive zero context but still produce a confident-sounding answer — drawn entirely from its training data. For a knowledge base tool, a "not found" is more honest and more useful than a hallucinated answer.

***

## What I Learned

**"Do not use knowledge outside the provided context" works, but imperfectly.** LLMs will still sometimes blend in training knowledge, especially when the question is about something the model knows very well (like basic Python syntax). For those cases the answer is usually correct, but it's unverifiable. I added a debug mode that includes instruction violations in the response — flagging when the LLM cites something not in the retrieved context.

**Low temperature doesn't mean low quality.** I was worried that `temperature=0.1` would make answers sound robotic. In practice, for factual technical questions the answers are clear and well-phrased. The LLM is reasoning from provided text, not generating creative output — temperature has less effect when the answer is constrained by context.

**The "no context" message should explain what to try next.** My original "I couldn't find relevant information" response was unhelpful. Adding "try rephrasing" and "check if the topic has been documented" gives the user something actionable. I also log the failed query so I know which gaps to fill in the knowledge base.

**Context ordering affects answer quality.** I order chunks by similarity score descending (highest similarity first) when assembling the context block. This puts the most relevant information early in the context, where LLM attention is strongest. Reversing the order measurably degraded answer quality on my test set — the LLM weighted later context more heavily even when earlier context was more relevant.

***

**Next**: [Article 7 — Wrapping Everything in a FastAPI Service](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/rag-101/rag-101-fastapi-service)
