# Part 5: Prompt Engineering for Production Systems

## Prompt Engineering Is Software Engineering

When I first heard "prompt engineering," I thought it was about cleverly phrasing questions to get better answers from ChatGPT. After building production systems that depend on LLM outputs, I've come to see it differently. Prompt engineering is designing the interface between your code and the language model — and it requires the same discipline as designing any other software interface.

A prompt in a production system isn't a one-off question. It's a template that runs thousands of times with different inputs. It needs to handle edge cases, produce consistent output formats, and degrade gracefully when the input is unexpected. The skills are less "creative writing" and more "API contract design."

This article covers the prompt engineering patterns I use in my own projects — from basic structure to Pydantic-validated templates and defensive techniques.

***

## The Anatomy of a Production Prompt

Every prompt I write for a production system has three layers:

```
┌─────────────────────────────────┐
│         System Prompt            │  ← Who the model is, rules, output format
├─────────────────────────────────┤
│         Context                  │  ← Retrieved documents, user data, history
├─────────────────────────────────┤
│         User Message             │  ← The actual question or instruction
└─────────────────────────────────┘
```

Here's how that looks in code:

```python
# A complete prompt for my RAG service
SYSTEM_PROMPT = """You are a technical documentation assistant. You answer questions
based ONLY on the provided context. Follow these rules:

1. If the context contains the answer, provide it with specific details.
2. If the context does not contain the answer, say "I don't have information about this
   in my knowledge base."
3. Never make up information that isn't in the context.
4. Reference the source document when possible.
5. Keep answers concise — aim for 2-4 paragraphs maximum.

Output format:
- Start with a direct answer to the question
- Follow with supporting details from the context
- End with source references if applicable"""


def build_rag_prompt(
    question: str,
    context_chunks: list[dict[str, str]],
) -> list[dict[str, str]]:
    """Build a complete prompt for RAG question-answering."""
    # Format retrieved context
    context_parts = []
    for i, chunk in enumerate(context_chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['title']}]\n{chunk['content']}"
        )
    context_text = "\n\n---\n\n".join(context_parts)

    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": f"Context:\n{context_text}\n\nQuestion: {question}",
        },
    ]
```

### Why This Structure Matters

* **System prompt** sets behavioral constraints that apply to every request. I put output format rules here because they don't change between requests.
* **Context** is dynamic per request — different retrieved chunks, different user data. Separating it from the system prompt means I can cache the system prompt tokens.
* **User message** is the actual input. Keeping it last gives it the strongest attention from the model (the "recency effect" from Part 3).

***

## System Prompt Design

The system prompt is the most important part of a production prompt. It's the contract between your code and the model. Here are the patterns I've found effective:

### Pattern 1: Role + Rules + Format

```python
EXTRACTION_SYSTEM_PROMPT = """You are a structured data extraction system.

Your task: Extract specific fields from the provided text and return them as JSON.

Rules:
- Extract ONLY the fields specified in the schema below
- If a field is not present in the text, use null
- Do not infer or guess values that aren't explicitly stated
- Dates should be in ISO 8601 format (YYYY-MM-DD)
- Numbers should be numeric types, not strings

Output schema:
{
    "title": "string or null",
    "date": "ISO date string or null",
    "author": "string or null",
    "tags": ["array of strings"],
    "summary": "string, max 100 words"
}

Respond with ONLY the JSON object. No explanation, no markdown fences."""
```

### Pattern 2: Explicit Boundaries

When I built my DevOps monitoring agent, I needed the LLM to analyze alerts but not take dangerous actions:

````python
MONITORING_SYSTEM_PROMPT = """You are a Kubernetes monitoring assistant that analyzes
cluster alerts and suggests remediation steps.

You CAN:
- Analyze alert content and explain likely root causes
- Suggest kubectl commands for investigation
- Recommend remediation steps
- Explain Kubernetes concepts relevant to the alert

You CANNOT:
- Execute any commands directly
- Access the cluster directly
- Modify any resources
- Suggest destructive operations (delete pods, scale to 0, drain nodes)
  without explicit warnings

When suggesting commands, always format them as:
```bash
# Explanation of what this does
kubectl command here
````

Always start with investigation commands before suggesting changes."""

````

### Pattern 3: Few-Shot Examples

For tasks where the output format is critical, I include examples directly in the system prompt:

```python
CLASSIFICATION_SYSTEM_PROMPT = """You classify support messages into categories.

Categories: billing, technical, account, general

Examples:

Input: "I was charged twice for my subscription last month"
Output: {"category": "billing", "confidence": 0.95}

Input: "The API is returning 500 errors on the /users endpoint"
Output: {"category": "technical", "confidence": 0.92}

Input: "How do I change my password?"
Output: {"category": "account", "confidence": 0.88}

Input: "What are your business hours?"
Output: {"category": "general", "confidence": 0.90}

Respond with ONLY a JSON object containing "category" and "confidence" (0.0 to 1.0)."""
````

Few-shot examples are the most reliable way I've found to control output format. The model learns the pattern from examples better than from instructions alone.

***

## Prompt Templates with Pydantic

In my projects, I never construct prompts with raw string concatenation. Instead, I use Pydantic models to define and validate prompt templates:

```python
# src/ai_engineer/prompts/templates.py
from pydantic import BaseModel, Field


class RAGPromptInput(BaseModel):
    """Validated input for the RAG prompt template."""

    question: str = Field(..., min_length=1, max_length=2000)
    context_chunks: list[dict[str, str]] = Field(
        ..., min_length=1, max_length=10
    )
    max_context_tokens: int = Field(default=3000, ge=500, le=8000)


class ExtractionPromptInput(BaseModel):
    """Validated input for data extraction."""

    text: str = Field(..., min_length=10, max_length=10000)
    fields: list[str] = Field(..., min_length=1)
    output_format: str = Field(default="json")


class PromptBuilder:
    """Build validated prompts from templates."""

    @staticmethod
    def build_rag_prompt(input_data: RAGPromptInput) -> list[dict[str, str]]:
        """Build a RAG prompt with validated inputs."""
        context_parts = []
        total_chars = 0

        for i, chunk in enumerate(input_data.context_chunks, 1):
            chunk_text = f"[Source {i}: {chunk['title']}]\n{chunk['content']}"

            # Rough token estimate: 1 token ≈ 4 chars
            estimated_tokens = len(chunk_text) / 4
            if total_chars / 4 + estimated_tokens > input_data.max_context_tokens:
                break

            context_parts.append(chunk_text)
            total_chars += len(chunk_text)

        context_text = "\n\n---\n\n".join(context_parts)

        return [
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": f"Context:\n{context_text}\n\nQuestion: {input_data.question}",
            },
        ]

    @staticmethod
    def build_extraction_prompt(
        input_data: ExtractionPromptInput,
    ) -> list[dict[str, str]]:
        """Build a data extraction prompt with validated inputs."""
        fields_str = ", ".join(input_data.fields)

        return [
            {
                "role": "system",
                "content": f"Extract the following fields from the text: {fields_str}.\n"
                f"Return as {input_data.output_format}. Use null for missing fields.",
            },
            {"role": "user", "content": input_data.text},
        ]
```

### Why Pydantic for Prompts?

1. **Validation at construction time.** If someone passes an empty question or 50 context chunks, the model catches it before the LLM call. I've had cases where upstream bugs produced empty strings that created nonsensical prompts — Pydantic catches these immediately.
2. **Self-documenting.** The model fields tell you exactly what inputs the prompt expects, with constraints.
3. **Testable.** I can write unit tests for prompt construction without calling the LLM:

```python
# tests/test_prompts.py
import pytest
from pydantic import ValidationError
from ai_engineer.prompts.templates import RAGPromptInput, PromptBuilder


def test_rag_prompt_valid():
    input_data = RAGPromptInput(
        question="What is pgvector?",
        context_chunks=[
            {"title": "pgvector Setup", "content": "pgvector is a PostgreSQL extension..."}
        ],
    )
    messages = PromptBuilder.build_rag_prompt(input_data)

    assert len(messages) == 2
    assert messages[0]["role"] == "system"
    assert "pgvector" in messages[1]["content"]


def test_rag_prompt_empty_question():
    with pytest.raises(ValidationError):
        RAGPromptInput(
            question="",
            context_chunks=[{"title": "Test", "content": "Content"}],
        )


def test_rag_prompt_no_context():
    with pytest.raises(ValidationError):
        RAGPromptInput(
            question="What is pgvector?",
            context_chunks=[],
        )
```

***

## Structured Output — Getting JSON from LLMs

One of the most common tasks in AI engineering is getting the model to return structured data. I've gone through several approaches:

### Approach 1: Prompt Instructions (Fragile)

```python
# This works ~90% of the time — not enough for production
prompt = """Extract the person's name and email from this text.
Return as JSON with "name" and "email" fields.

Text: Contact John Smith at john@example.com for details."""
```

The model might return `{"name": "John Smith", "email": "john@example.com"}` — or it might wrap it in markdown fences, add an explanation, or use different field names.

### Approach 2: JSON Mode (Better)

Most API providers now support a JSON mode:

```python
import httpx

async def extract_structured(
    prompt: str,
    api_key: str,
) -> dict:
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://models.inference.ai.azure.com/chat/completions",
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "model": "gpt-4o",
                "messages": [{"role": "user", "content": prompt}],
                "response_format": {"type": "json_object"},
                "temperature": 0,
            },
        )
        response.raise_for_status()
        content = response.json()["choices"][0]["message"]["content"]
        return json.loads(content)
```

JSON mode guarantees valid JSON but doesn't guarantee the schema matches what you expect.

### Approach 3: Pydantic Validation on Output (My Preferred Approach)

I combine JSON mode with Pydantic validation to get reliable structured output:

````python
# src/ai_engineer/llm/structured.py
import json
from typing import TypeVar

from pydantic import BaseModel, ValidationError

from ai_engineer.llm.base import LLMProvider

T = TypeVar("T", bound=BaseModel)


async def generate_structured(
    provider: LLMProvider,
    prompt: str,
    response_model: type[T],
    *,
    max_retries: int = 2,
) -> T:
    """Generate a structured response validated against a Pydantic model.

    Retries with error feedback if validation fails.
    """
    # Include the schema in the prompt
    schema_str = json.dumps(response_model.model_json_schema(), indent=2)
    full_prompt = (
        f"{prompt}\n\n"
        f"Respond with a JSON object matching this schema:\n{schema_str}\n\n"
        f"Return ONLY the JSON object."
    )

    last_error = None
    for attempt in range(max_retries + 1):
        if attempt > 0 and last_error:
            # Add error feedback for retry
            full_prompt += f"\n\nPrevious attempt failed with: {last_error}\nPlease fix and try again."

        raw_response = await provider.generate(
            full_prompt,
            temperature=0.0,
            max_tokens=1024,
        )

        # Clean up common issues
        cleaned = raw_response.strip()
        if cleaned.startswith("```"):
            # Strip markdown code fences
            lines = cleaned.split("\n")
            cleaned = "\n".join(lines[1:-1])

        try:
            data = json.loads(cleaned)
            return response_model.model_validate(data)
        except (json.JSONDecodeError, ValidationError) as e:
            last_error = str(e)

    raise ValueError(
        f"Failed to get valid structured output after {max_retries + 1} attempts. "
        f"Last error: {last_error}"
    )
````

Usage:

```python
from pydantic import BaseModel, Field


class ExtractedContact(BaseModel):
    name: str = Field(..., min_length=1)
    email: str = Field(..., pattern=r"^[\w.+-]+@[\w-]+\.[\w.]+$")
    phone: str | None = None


# This will retry with error feedback if the model returns invalid email format
contact = await generate_structured(
    provider=llm_provider,
    prompt="Extract contact info from: Call John Smith at john@example.com or 555-0123",
    response_model=ExtractedContact,
)
print(contact)  # ExtractedContact(name='John Smith', email='john@example.com', phone='555-0123')
```

This pattern has been reliable in my production code. The retry with error feedback handles the \~5% of cases where the model's first attempt doesn't match the schema.

***

## Defensive Prompting

Production prompts need to handle adversarial and unexpected inputs. Here are the techniques I use:

### Input Sanitization

```python
# src/ai_engineer/prompts/sanitize.py
import re


def sanitize_user_input(text: str) -> str:
    """Sanitize user input before including it in a prompt.

    Prevents basic prompt injection attempts.
    """
    # Remove potential prompt injection patterns
    # These are heuristic — not a security boundary
    sanitized = text

    # Remove attempts to override system instructions
    injection_patterns = [
        r"ignore (?:all )?(?:previous |above )?instructions",
        r"you are now",
        r"new instructions:",
        r"system:\s",
        r"<\|(?:im_start|system)\|>",
    ]

    for pattern in injection_patterns:
        sanitized = re.sub(pattern, "[filtered]", sanitized, flags=re.IGNORECASE)

    # Truncate to reasonable length
    max_length = 2000
    if len(sanitized) > max_length:
        sanitized = sanitized[:max_length] + "... [truncated]"

    return sanitized
```

### Output Validation

Never trust LLM output. Always validate:

```python
async def ask_with_validation(
    question: str,
    context_chunks: list[dict],
    provider: LLMProvider,
) -> dict:
    """Ask a question with input sanitization and output validation."""
    # Sanitize input
    clean_question = sanitize_user_input(question)

    # Build prompt
    prompt_input = RAGPromptInput(
        question=clean_question,
        context_chunks=context_chunks,
    )
    messages = PromptBuilder.build_rag_prompt(prompt_input)

    # Generate response
    raw_answer = await provider.generate(
        messages[-1]["content"],
        temperature=0.1,
        max_tokens=1024,
    )

    # Validate output
    answer = raw_answer.strip()
    if not answer:
        answer = "I was unable to generate a response. Please try rephrasing your question."

    if len(answer) > 5000:
        answer = answer[:5000] + "\n\n[Response truncated]"

    return {
        "answer": answer,
        "question": clean_question,
        "sources_used": len(context_chunks),
    }
```

### Handling Edge Cases

Through my own testing, I've identified common edge cases that need handling:

```python
# Edge cases my RAG system handles

# 1. Question outside knowledge base
# System prompt instruction: "say 'I don't have information about this'"

# 2. Ambiguous question
# Handle by asking for clarification in the response

# 3. Question about multiple topics
# Retrieve context for the primary topic, note limitations

# 4. Very short question ("what?", "help")
# Return a helpful default: "Could you provide more detail about what you're looking for?"

# 5. Non-English input
# The system prompt doesn't restrict language, so the model handles it naturally
# But embeddings may be less accurate for non-English text

MIN_QUESTION_LENGTH = 3

def validate_question(question: str) -> str | None:
    """Validate a question before processing. Returns error message or None."""
    stripped = question.strip()
    if len(stripped) < MIN_QUESTION_LENGTH:
        return "Please provide a more detailed question (at least a few words)."
    if len(stripped) > 2000:
        return "Question is too long. Please keep it under 2000 characters."
    return None
```

***

## Prompt Versioning

In my projects, I version prompts the same way I version code. When I change a system prompt, I want to track what changed and measure the impact:

```python
# src/ai_engineer/prompts/registry.py
from dataclasses import dataclass


@dataclass(frozen=True)
class PromptVersion:
    name: str
    version: str
    system_prompt: str
    description: str


# Registry of all prompt versions
PROMPTS: dict[str, PromptVersion] = {
    "rag-v1": PromptVersion(
        name="rag",
        version="v1",
        system_prompt="""You are a documentation assistant. Answer based on the provided context.
If the answer is not in the context, say so.""",
        description="Initial RAG prompt — simple and direct",
    ),
    "rag-v2": PromptVersion(
        name="rag",
        version="v2",
        system_prompt="""You are a technical documentation assistant. You answer questions
based ONLY on the provided context. Follow these rules:

1. If the context contains the answer, provide it with specific details.
2. If the context does not contain the answer, say "I don't have information about this
   in my knowledge base."
3. Never make up information that isn't in the context.
4. Reference the source document when possible.
5. Keep answers concise — aim for 2-4 paragraphs maximum.

Output format:
- Start with a direct answer to the question
- Follow with supporting details from the context
- End with source references if applicable""",
        description="Added explicit rules, output format, and boundary behavior",
    ),
}


def get_prompt(name: str, version: str | None = None) -> PromptVersion:
    """Get a prompt by name and optional version."""
    if version:
        key = f"{name}-{version}"
        if key not in PROMPTS:
            raise KeyError(f"Prompt '{key}' not found")
        return PROMPTS[key]

    # Get the latest version
    matching = [
        (k, v) for k, v in PROMPTS.items() if v.name == name
    ]
    if not matching:
        raise KeyError(f"No prompts found with name '{name}'")
    return max(matching, key=lambda x: x[1].version)[1]
```

The v1 → v2 change happened after I noticed the model was sometimes making up information when the context didn't contain the answer. Adding explicit rule #2 ("say I don't have information") cut hallucinations in my testing by about 80%.

***

## Key Takeaways

1. **System prompts are interfaces.** Design them with the same care as an API contract: define inputs, outputs, constraints, and error behavior.
2. **Use Pydantic for prompt inputs AND outputs.** Validation at both ends catches bugs before they become user-visible problems.
3. **Few-shot examples are more reliable than instructions.** When output format matters, show the model what you want rather than describing it.
4. **Never trust LLM output.** Always validate, truncate, and handle malformed responses gracefully.
5. **Version your prompts.** A prompt change can affect every response your system produces. Track changes and measure impact.
6. **Temperature=0 for structured output.** Deterministic decoding dramatically improves format consistency.

***

**Previous:** [**Part 4 — Embeddings and Vector Search**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-4-embeddings-and-vector-search)

**Next:** [**Part 6 — Building AI-Powered APIs with FastAPI**](https://blog.htunnthuthu.com/ai-and-machine-learning/artificial-intelligence/ai-engineer-101/part-6-building-ai-apis)
