Part 5: Prompt Engineering for Production Systems

Prompt Engineering Is Software Engineering

When I first heard "prompt engineering," I thought it was about cleverly phrasing questions to get better answers from ChatGPT. After building production systems that depend on LLM outputs, I've come to see it differently. Prompt engineering is designing the interface between your code and the language model β€” and it requires the same discipline as designing any other software interface.

A prompt in a production system isn't a one-off question. It's a template that runs thousands of times with different inputs. It needs to handle edge cases, produce consistent output formats, and degrade gracefully when the input is unexpected. The skills are less "creative writing" and more "API contract design."

This article covers the prompt engineering patterns I use in my own projects β€” from basic structure to Pydantic-validated templates and defensive techniques.


The Anatomy of a Production Prompt

Every prompt I write for a production system has three layers:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         System Prompt            β”‚  ← Who the model is, rules, output format
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         Context                  β”‚  ← Retrieved documents, user data, history
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         User Message             β”‚  ← The actual question or instruction
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Here's how that looks in code:

# A complete prompt for my RAG service
SYSTEM_PROMPT = """You are a technical documentation assistant. You answer questions
based ONLY on the provided context. Follow these rules:

1. If the context contains the answer, provide it with specific details.
2. If the context does not contain the answer, say "I don't have information about this
   in my knowledge base."
3. Never make up information that isn't in the context.
4. Reference the source document when possible.
5. Keep answers concise β€” aim for 2-4 paragraphs maximum.

Output format:
- Start with a direct answer to the question
- Follow with supporting details from the context
- End with source references if applicable"""


def build_rag_prompt(
    question: str,
    context_chunks: list[dict[str, str]],
) -> list[dict[str, str]]:
    """Build a complete prompt for RAG question-answering."""
    # Format retrieved context
    context_parts = []
    for i, chunk in enumerate(context_chunks, 1):
        context_parts.append(
            f"[Source {i}: {chunk['title']}]\n{chunk['content']}"
        )
    context_text = "\n\n---\n\n".join(context_parts)

    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": f"Context:\n{context_text}\n\nQuestion: {question}",
        },
    ]

Why This Structure Matters

  • System prompt sets behavioral constraints that apply to every request. I put output format rules here because they don't change between requests.

  • Context is dynamic per request β€” different retrieved chunks, different user data. Separating it from the system prompt means I can cache the system prompt tokens.

  • User message is the actual input. Keeping it last gives it the strongest attention from the model (the "recency effect" from Part 3).


System Prompt Design

The system prompt is the most important part of a production prompt. It's the contract between your code and the model. Here are the patterns I've found effective:

Pattern 1: Role + Rules + Format

Pattern 2: Explicit Boundaries

When I built my DevOps monitoring agent, I needed the LLM to analyze alerts but not take dangerous actions:

Always start with investigation commands before suggesting changes."""

Few-shot examples are the most reliable way I've found to control output format. The model learns the pattern from examples better than from instructions alone.


Prompt Templates with Pydantic

In my projects, I never construct prompts with raw string concatenation. Instead, I use Pydantic models to define and validate prompt templates:

Why Pydantic for Prompts?

  1. Validation at construction time. If someone passes an empty question or 50 context chunks, the model catches it before the LLM call. I've had cases where upstream bugs produced empty strings that created nonsensical prompts β€” Pydantic catches these immediately.

  2. Self-documenting. The model fields tell you exactly what inputs the prompt expects, with constraints.

  3. Testable. I can write unit tests for prompt construction without calling the LLM:


Structured Output β€” Getting JSON from LLMs

One of the most common tasks in AI engineering is getting the model to return structured data. I've gone through several approaches:

Approach 1: Prompt Instructions (Fragile)

The model might return {"name": "John Smith", "email": "[email protected]"} β€” or it might wrap it in markdown fences, add an explanation, or use different field names.

Approach 2: JSON Mode (Better)

Most API providers now support a JSON mode:

JSON mode guarantees valid JSON but doesn't guarantee the schema matches what you expect.

Approach 3: Pydantic Validation on Output (My Preferred Approach)

I combine JSON mode with Pydantic validation to get reliable structured output:

Usage:

This pattern has been reliable in my production code. The retry with error feedback handles the ~5% of cases where the model's first attempt doesn't match the schema.


Defensive Prompting

Production prompts need to handle adversarial and unexpected inputs. Here are the techniques I use:

Input Sanitization

Output Validation

Never trust LLM output. Always validate:

Handling Edge Cases

Through my own testing, I've identified common edge cases that need handling:


Prompt Versioning

In my projects, I version prompts the same way I version code. When I change a system prompt, I want to track what changed and measure the impact:

The v1 β†’ v2 change happened after I noticed the model was sometimes making up information when the context didn't contain the answer. Adding explicit rule #2 ("say I don't have information") cut hallucinations in my testing by about 80%.


Key Takeaways

  1. System prompts are interfaces. Design them with the same care as an API contract: define inputs, outputs, constraints, and error behavior.

  2. Use Pydantic for prompt inputs AND outputs. Validation at both ends catches bugs before they become user-visible problems.

  3. Few-shot examples are more reliable than instructions. When output format matters, show the model what you want rather than describing it.

  4. Never trust LLM output. Always validate, truncate, and handle malformed responses gracefully.

  5. Version your prompts. A prompt change can affect every response your system produces. Track changes and measure impact.

  6. Temperature=0 for structured output. Deterministic decoding dramatically improves format consistency.


Previous: Part 4 β€” Embeddings and Vector Search

Next: Part 6 β€” Building AI-Powered APIs with FastAPI

Last updated