Part 4: Production Patterns and Best Practices

Part of the LLM API Development 101 Series

My $2000 Lesson

Deployed my document analysis API on Friday. Everything tested perfectly. Went home feeling accomplished.

Monday morning: $2000 Claude API bill for the weekend.

What happened? No caching. Same documents analyzed repeatedly. Users kept refreshing the page, retriggering full analysis every time.

Implemented Redis caching: Reduced costs by 75%. Same functionality, better performance, way less money.

Production is different from testing. Let me share what I learned the expensive way.

Caching Strategies

Caching is the #1 cost optimization for LLM applications.

Basic Response Caching

import hashlib
import json
from functools import lru_cache
from typing import Optional

def generate_cache_key(messages: list, model: str, temperature: float) -> str:
    """Generate cache key from request parameters."""
    
    # Create deterministic string
    cache_data = {
        "messages": messages,
        "model": model,
        "temperature": temperature
    }
    
    # Hash for compact key
    data_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(data_str.encode()).hexdigest()

# Simple in-memory cache
response_cache: dict = {}

async def cached_claude_call(messages: list, model: str, temperature: float):
    """Call Claude with caching."""
    
    # Generate cache key
    cache_key = generate_cache_key(messages, model, temperature)
    
    # Check cache
    if cache_key in response_cache:
        logger.info(f"Cache hit: {cache_key[:8]}")
        return response_cache[cache_key]
    
    # Call API
    logger.info(f"Cache miss: {cache_key[:8]}")
    response = await call_claude_async(messages, model, temperature)
    
    # Store in cache
    response_cache[cache_key] = response
    
    return response

Simple but effective - saved me thousands of dollars.

Redis Caching (Production)

import redis
import json
import hashlib
from typing import Optional
import anthropic

class ClaudeCacheManager:
    """Production-ready cache for Claude responses."""
    
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url, decode_responses=True)
        self.default_ttl = 3600 * 24  # 24 hours
    
    def generate_key(self, messages: list, model: str, **kwargs) -> str:
        """Generate cache key."""
        cache_data = {
            "messages": messages,
            "model": model,
            **kwargs
        }
        data_str = json.dumps(cache_data, sort_keys=True)
        hash_value = hashlib.sha256(data_str.encode()).hexdigest()
        return f"claude:response:{hash_value}"
    
    def get(self, key: str) -> Optional[dict]:
        """Get cached response."""
        data = self.redis.get(key)
        if data:
            return json.loads(data)
        return None
    
    def set(self, key: str, value: dict, ttl: Optional[int] = None):
        """Cache response."""
        ttl = ttl or self.default_ttl
        self.redis.setex(key, ttl, json.dumps(value))
    
    def invalidate(self, pattern: str = "claude:response:*"):
        """Invalidate cache entries."""
        keys = self.redis.keys(pattern)
        if keys:
            self.redis.delete(*keys)
    
    async def get_or_call(
        self,
        messages: list,
        model: str,
        max_tokens: int,
        temperature: float,
        **kwargs
    ) -> dict:
        """Get from cache or call API."""
        
        # Generate cache key
        cache_key = self.generate_key(
            messages=messages,
            model=model,
            temperature=temperature
        )
        
        # Try cache first
        cached = self.get(cache_key)
        if cached:
            logger.info(f"Cache hit: {cache_key[:16]}")
            cached['cached'] = True
            return cached
        
        # Call API
        logger.info(f"Cache miss: {cache_key[:16]}")
        
        response = await call_claude_async(
            messages=messages,
            model=model,
            max_tokens=max_tokens,
            temperature=temperature
        )
        
        # Cache the response
        cache_data = {
            "response": response.content[0].text,
            "model": model,
            "usage": {
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens
            },
            "cached": False
        }
        
        # Only cache deterministic responses (low temperature)
        if temperature < 0.3:
            self.set(cache_key, cache_data)
        
        return cache_data

# Usage
cache = ClaudeCacheManager()

response = await cache.get_or_call(
    messages=[{"role": "user", "content": "What is FastAPI?"}],
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    temperature=0.0
)

print(f"Cached: {response['cached']}")

This is my production setup - works across multiple servers.

Semantic Caching

Cache similar queries, not just exact matches:

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Tuple

class SemanticCache:
    """Cache based on semantic similarity."""
    
    def __init__(self, similarity_threshold: float = 0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache: List[Tuple[np.ndarray, str]] = []
        self.threshold = similarity_threshold
    
    def get_embedding(self, text: str) -> np.ndarray:
        """Get text embedding."""
        return self.model.encode(text)
    
    def find_similar(self, query: str) -> Optional[str]:
        """Find semantically similar cached response."""
        
        query_embedding = self.get_embedding(query)
        
        for cached_embedding, cached_response in self.cache:
            # Cosine similarity
            similarity = np.dot(query_embedding, cached_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
            )
            
            if similarity >= self.threshold:
                logger.info(f"Semantic cache hit (similarity: {similarity:.3f})")
                return cached_response
        
        return None
    
    def add(self, query: str, response: str):
        """Add to cache."""
        embedding = self.get_embedding(query)
        self.cache.append((embedding, response))
        
        # Limit cache size
        if len(self.cache) > 1000:
            self.cache.pop(0)

# Usage
semantic_cache = SemanticCache()

# Add to cache
semantic_cache.add("What is FastAPI?", "FastAPI is a modern web framework...")

# This will hit cache even with different wording
response = semantic_cache.find_similar("Tell me about FastAPI")
# Returns cached response (high similarity)

Semantic caching boosted my cache hit rate from 40% to 75%.

Circuit Breakers

Protect your app when Claude API has issues.

Basic Circuit Breaker

import time
from enum import Enum
from typing import Callable, Any
import asyncio

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if recovered

class CircuitBreaker:
    """Circuit breaker for external API calls."""
    
    def __init__(
        self,
        failure_threshold: int = 5,
        timeout: int = 60,
        expected_exception: Exception = Exception
    ):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.expected_exception = expected_exception
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with circuit breaker protection."""
        
        if self.state == CircuitState.OPEN:
            # Check if timeout passed
            if time.time() - self.last_failure_time >= self.timeout:
                logger.info("Circuit breaker: Entering HALF_OPEN state")
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            # Call function
            result = await func(*args, **kwargs)
            
            # Success - reset if was half-open
            if self.state == CircuitState.HALF_OPEN:
                logger.info("Circuit breaker: Recovered, entering CLOSED state")
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            
            return result
            
        except self.expected_exception as e:
            # Failure
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            logger.warning(
                f"Circuit breaker: Failure {self.failure_count}/{self.failure_threshold}"
            )
            
            if self.failure_count >= self.failure_threshold:
                logger.error("Circuit breaker: Entering OPEN state")
                self.state = CircuitState.OPEN
            
            raise

# Usage
circuit_breaker = CircuitBreaker(
    failure_threshold=5,
    timeout=60,
    expected_exception=anthropic.APIError
)

async def protected_claude_call(messages):
    """Claude call with circuit breaker."""
    return await circuit_breaker.call(call_claude_async, messages)

Saved my app during Claude API outages - fails fast instead of hanging.

Production Circuit Breaker with Fallback

from typing import Optional, Callable
import anthropic

class ProductionCircuitBreaker:
    """Circuit breaker with fallback logic."""
    
    def __init__(
        self,
        failure_threshold: int = 5,
        timeout: int = 60,
        fallback: Optional[Callable] = None
    ):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.fallback = fallback
        
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    async def call(self, func: Callable, *args, **kwargs):
        """Call with circuit breaker protection."""
        
        # Check if circuit is open
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time < self.timeout:
                # Circuit still open - use fallback
                if self.fallback:
                    logger.warning("Circuit OPEN - using fallback")
                    return await self.fallback(*args, **kwargs)
                else:
                    raise Exception("Service unavailable (circuit open)")
            
            # Timeout passed - try half-open
            self.state = CircuitState.HALF_OPEN
        
        try:
            result = await func(*args, **kwargs)
            
            # Success
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                logger.info("Circuit breaker recovered")
            
            return result
            
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            # Open circuit if threshold reached
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
                logger.error(f"Circuit breaker OPEN after {self.failure_count} failures")
            
            # Try fallback
            if self.fallback:
                logger.warning("Primary failed - using fallback")
                return await self.fallback(*args, **kwargs)
            
            raise

# Fallback implementation
async def fallback_response(*args, **kwargs):
    """Fallback when Claude API is down."""
    return {
        "response": "I'm currently experiencing technical difficulties. Please try again in a few moments.",
        "fallback": True
    }

# Usage
breaker = ProductionCircuitBreaker(
    failure_threshold=3,
    timeout=60,
    fallback=fallback_response
)

response = await breaker.call(call_claude_async, messages)

Users get graceful degradation instead of errors.

Prompt Versioning

Prompts evolve. Track versions for consistency.

Prompt Version Manager

from enum import Enum
from typing import Dict
import json
from datetime import datetime

class PromptVersion(Enum):
    V1 = "v1"
    V2 = "v2"
    V3 = "v3"
    LATEST = "v3"

class PromptManager:
    """Manage prompt templates and versions."""
    
    def __init__(self):
        self.prompts: Dict[str, Dict[str, str]] = {
            "summarization": {
                "v1": "Summarize this text: {text}",
                
                "v2": """Summarize the following text concisely.
                
Text: {text}

Summary:""",
                
                "v3": """You are a professional summarizer.

Create a concise summary of the following text, capturing key points while staying under {max_words} words.

Text:
{text}

Summary:"""
            },
            
            "code_review": {
                "v1": "Review this code: {code}",
                
                "v2": """Review the following code for:
- Bugs
- Best practices
- Performance

Code:
{code}""",
                
                "v3": """You are an expert code reviewer.

Review the following {language} code for:
1. Correctness and potential bugs
2. Code quality and best practices
3. Performance considerations
4. Security issues

Provide specific, actionable feedback.

Code:
{code}

Review:"""
            }
        }
    
    def get_prompt(
        self,
        template_name: str,
        version: PromptVersion = PromptVersion.LATEST,
        **kwargs
    ) -> str:
        """Get prompt with version."""
        
        if template_name not in self.prompts:
            raise ValueError(f"Unknown template: {template_name}")
        
        version_str = version.value
        
        if version_str not in self.prompts[template_name]:
            raise ValueError(f"Version {version_str} not found for {template_name}")
        
        template = self.prompts[template_name][version_str]
        return template.format(**kwargs)
    
    def log_usage(self, template_name: str, version: str):
        """Log prompt usage for analytics."""
        logger.info(f"Prompt used: {template_name} {version}")

# Usage
pm = PromptManager()

# Get latest version
prompt = pm.get_prompt(
    "summarization",
    version=PromptVersion.LATEST,
    text="Long article text here...",
    max_words=100
)

# Use specific version for A/B testing
prompt_v2 = pm.get_prompt(
    "summarization",
    version=PromptVersion.V2,
    text="Long article text here..."
)

Easy to A/B test prompts and roll back if needed.

Cost Optimization

Control your LLM spending.

Token Budget Manager

from datetime import datetime, timedelta
from typing import Optional
import redis

class TokenBudgetManager:
    """Manage token usage budgets."""
    
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.daily_limit = 1_000_000  # 1M tokens/day
        self.hourly_limit = 100_000   # 100K tokens/hour
    
    def get_key(self, period: str) -> str:
        """Get Redis key for period."""
        today = datetime.now().strftime("%Y-%m-%d")
        hour = datetime.now().strftime("%H")
        
        if period == "daily":
            return f"tokens:daily:{today}"
        else:
            return f"tokens:hourly:{today}:{hour}"
    
    def get_usage(self, period: str) -> int:
        """Get current usage."""
        key = self.get_key(period)
        usage = self.redis.get(key)
        return int(usage) if usage else 0
    
    def add_usage(self, tokens: int):
        """Record token usage."""
        
        # Daily usage
        daily_key = self.get_key("daily")
        self.redis.incr(daily_key, tokens)
        self.redis.expire(daily_key, 86400)  # 24 hours
        
        # Hourly usage
        hourly_key = self.get_key("hourly")
        self.redis.incr(hourly_key, tokens)
        self.redis.expire(hourly_key, 3600)  # 1 hour
    
    def check_budget(self) -> tuple[bool, str]:
        """Check if request within budget."""
        
        daily_usage = self.get_usage("daily")
        hourly_usage = self.get_usage("hourly")
        
        if daily_usage >= self.daily_limit:
            return False, "Daily token limit exceeded"
        
        if hourly_usage >= self.hourly_limit:
            return False, "Hourly token limit exceeded"
        
        return True, "OK"
    
    def get_remaining(self) -> dict:
        """Get remaining budget."""
        return {
            "daily_used": self.get_usage("daily"),
            "daily_limit": self.daily_limit,
            "daily_remaining": max(0, self.daily_limit - self.get_usage("daily")),
            "hourly_used": self.get_usage("hourly"),
            "hourly_limit": self.hourly_limit,
            "hourly_remaining": max(0, self.hourly_limit - self.get_usage("hourly"))
        }

# Usage in FastAPI
budget_manager = TokenBudgetManager()

@app.post("/chat")
async def chat(request: ChatRequest):
    """Chat with budget checking."""
    
    # Check budget
    allowed, message = budget_manager.check_budget()
    if not allowed:
        raise HTTPException(status_code=429, detail=message)
    
    # Make API call
    response = await call_claude_async(request.messages)
    
    # Record usage
    tokens_used = response.usage.input_tokens + response.usage.output_tokens
    budget_manager.add_usage(tokens_used)
    
    return response

@app.get("/budget")
async def get_budget():
    """Get current budget status."""
    return budget_manager.get_remaining()

Prevents surprise bills - budget enforcement at API level.

Model Selection Strategy

def select_model(task_type: str, input_length: int) -> str:
    """Select appropriate model based on task."""
    
    # Simple tasks - use Haiku (cheapest)
    if task_type in ["summarization", "classification", "extraction"]:
        if input_length < 10000:
            return "claude-3-haiku-20240307"
    
    # Complex tasks - use Sonnet (balanced)
    if task_type in ["analysis", "generation", "conversation"]:
        return "claude-3-5-sonnet-20241022"
    
    # Very complex tasks - use Opus (most capable)
    if task_type in ["research", "complex_reasoning"]:
        return "claude-3-opus-20240229"
    
    # Default to Sonnet
    return "claude-3-5-sonnet-20241022"

# Usage
model = select_model("summarization", len(document))

Right model for right task - massive cost savings.

Monitoring and Observability

Know what's happening in production.

Metrics Collection

from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
api_requests = Counter(
    'claude_api_requests_total',
    'Total Claude API requests',
    ['model', 'status']
)

api_latency = Histogram(
    'claude_api_latency_seconds',
    'Claude API latency',
    ['model']
)

api_tokens = Counter(
    'claude_api_tokens_total',
    'Total tokens used',
    ['model', 'type']
)

cache_hits = Counter(
    'claude_cache_hits_total',
    'Cache hits'
)

cache_misses = Counter(
    'claude_cache_misses_total',
    'Cache misses'
)

# Track metrics
async def monitored_claude_call(messages, model):
    """Claude call with metrics."""
    
    start_time = time.time()
    
    try:
        response = await call_claude_async(messages, model)
        
        # Record success
        api_requests.labels(model=model, status='success').inc()
        
        # Record latency
        latency = time.time() - start_time
        api_latency.labels(model=model).observe(latency)
        
        # Record tokens
        api_tokens.labels(model=model, type='input').inc(
            response.usage.input_tokens
        )
        api_tokens.labels(model=model, type='output').inc(
            response.usage.output_tokens
        )
        
        return response
        
    except Exception as e:
        # Record failure
        api_requests.labels(model=model, status='error').inc()
        raise

# Expose metrics endpoint
from prometheus_client import make_asgi_app

metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Monitor everything - catch issues before users complain.

Best Practices Summary

From production experience:

1. Cache aggressively (but intelligently)

Cache deterministic responses (temperature < 0.3)
Use Redis for distributed caching
Consider semantic caching for similar queries

2. Implement circuit breakers

Fail fast when API is down
Provide fallback responses
Auto-recover when service returns

3. Version your prompts

Track changes
A/B test improvements
Easy rollback

4. Control costs

Set token budgets
Choose appropriate models
Monitor usage continuously

5. Monitor everything

Request rates
Latency
Token usage
Error rates
Cache hit rates

What's Next?

You now have production-ready patterns for building reliable, cost-effective LLM applications. In Part 5, we'll deploy this application to production with Docker, environment management, and cloud deployment.

Next: Part 5 - Deployment and Scaling

Previous: Part 3 - Streaming Responses and Advanced Features Series Home: LLM API Development 101

This article is part of the LLM API Development 101 series. All examples use Python 3 and FastAPI based on real production applications.

PreviousPart 3: Streaming Responses and Advanced Features NextPart 5: Deployment and Scaling

Last updated 2 days ago

hashtagMy $2000 Lesson

hashtagCaching Strategies

hashtagBasic Response Caching

hashtagRedis Caching (Production)

hashtagSemantic Caching

hashtagCircuit Breakers

hashtagBasic Circuit Breaker

hashtagProduction Circuit Breaker with Fallback

hashtagPrompt Versioning

hashtagPrompt Version Manager

hashtagCost Optimization

hashtagToken Budget Manager

hashtagModel Selection Strategy

hashtagMonitoring and Observability

hashtagMetrics Collection

hashtagBest Practices Summary

hashtagWhat's Next?