Part 4: Production Patterns and Best Practices

Part of the LLM API Development 101 Series

My $2000 Lesson

Deployed my document analysis API on Friday. Everything tested perfectly. Went home feeling accomplished.

Monday morning: $2000 Claude API bill for the weekend.

What happened? No caching. Same documents analyzed repeatedly. Users kept refreshing the page, retriggering full analysis every time.

Implemented Redis caching: Reduced costs by 75%. Same functionality, better performance, way less money.

Production is different from testing. Let me share what I learned the expensive way.

Caching Strategies

Caching is the #1 cost optimization for LLM applications.

Basic Response Caching

import hashlib
import json
from functools import lru_cache
from typing import Optional

def generate_cache_key(messages: list, model: str, temperature: float) -> str:
    """Generate cache key from request parameters."""
    
    # Create deterministic string
    cache_data = {
        "messages": messages,
        "model": model,
        "temperature": temperature
    }
    
    # Hash for compact key
    data_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(data_str.encode()).hexdigest()

# Simple in-memory cache
response_cache: dict = {}

async def cached_claude_call(messages: list, model: str, temperature: float):
    """Call Claude with caching."""
    
    # Generate cache key
    cache_key = generate_cache_key(messages, model, temperature)
    
    # Check cache
    if cache_key in response_cache:
        logger.info(f"Cache hit: {cache_key[:8]}")
        return response_cache[cache_key]
    
    # Call API
    logger.info(f"Cache miss: {cache_key[:8]}")
    response = await call_claude_async(messages, model, temperature)
    
    # Store in cache
    response_cache[cache_key] = response
    
    return response

Simple but effective - saved me thousands of dollars.

Redis Caching (Production)

This is my production setup - works across multiple servers.

Semantic Caching

Cache similar queries, not just exact matches:

Semantic caching boosted my cache hit rate from 40% to 75%.

Circuit Breakers

Protect your app when Claude API has issues.

Basic Circuit Breaker

Saved my app during Claude API outages - fails fast instead of hanging.

Production Circuit Breaker with Fallback

Users get graceful degradation instead of errors.

Prompt Versioning

Prompts evolve. Track versions for consistency.

Prompt Version Manager

Easy to A/B test prompts and roll back if needed.

Cost Optimization

Control your LLM spending.

Token Budget Manager

Prevents surprise bills - budget enforcement at API level.

Model Selection Strategy

Right model for right task - massive cost savings.

Monitoring and Observability

Know what's happening in production.

Metrics Collection

Monitor everything - catch issues before users complain.

Best Practices Summary

From production experience:

1. Cache aggressively (but intelligently)

  • Cache deterministic responses (temperature < 0.3)

  • Use Redis for distributed caching

  • Consider semantic caching for similar queries

2. Implement circuit breakers

  • Fail fast when API is down

  • Provide fallback responses

  • Auto-recover when service returns

3. Version your prompts

  • Track changes

  • A/B test improvements

  • Easy rollback

4. Control costs

  • Set token budgets

  • Choose appropriate models

  • Monitor usage continuously

5. Monitor everything

  • Request rates

  • Latency

  • Token usage

  • Error rates

  • Cache hit rates

What's Next?

You now have production-ready patterns for building reliable, cost-effective LLM applications. In Part 5, we'll deploy this application to production with Docker, environment management, and cloud deployment.

Next: Part 5 - Deployment and Scaling


Previous: Part 3 - Streaming Responses and Advanced Features Series Home: LLM API Development 101

This article is part of the LLM API Development 101 series. All examples use Python 3 and FastAPI based on real production applications.

Last updated