Part 3: Streaming Responses and Advanced Features

Part of the LLM API Development 101 Series

My Streaming Epiphany

Built a document summarization API. Users uploaded 50-page PDFs, waited...and waited...and waited 45 seconds for a complete response.

Users thought the app was broken. Support tickets piled up. "Is it working?" "Did it crash?"

Added streaming responses: Users saw text appearing word-by-word in real-time. Same 45-second processing time, but zero complaints.

Perception matters. Streaming transforms user experience from "is this broken?" to "wow, it's working!"

Let me show you how to implement it.

Understanding Streaming

Why Stream?

Traditional (non-streaming):

User sends request → Wait 30 seconds → Get complete response

User experience: Connection timeout, frustration, uncertainty.

Streaming:

User sends request → Immediate first token → Continuous word-by-word response

User experience: Instant feedback, perceived speed, engagement.

I use streaming for anything taking >2 seconds.

How Streaming Works

Server-Sent Events (SSE):

Server pushes data to client over HTTP
Connection stays open
Client receives events as they arrive
Standard text/event-stream format

Claude streaming:

Call API with stream=True
Receive events as model generates
Accumulate tokens into complete response

Basic Streaming with Claude

Simple Streaming Example

import anthropic
import os
from dotenv import load_dotenv

load_dotenv()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

# Stream response
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain FastAPI streaming"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

print()  # New line at end

Output appears word-by-word as Claude generates it.

Understanding Stream Events

import anthropic

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
) as stream:
    # Different event types
    for event in stream:
        # Message start
        if event.type == "message_start":
            print(f"Message started: {event.message.id}")
        
        # Content block start
        elif event.type == "content_block_start":
            print("Content block started")
        
        # Content delta (actual text)
        elif event.type == "content_block_delta":
            print(event.delta.text, end="", flush=True)
        
        # Content block end
        elif event.type == "content_block_stop":
            print("\nContent block ended")
        
        # Message end (includes token usage)
        elif event.type == "message_delta":
            print(f"Stop reason: {event.delta.stop_reason}")
        
        elif event.type == "message_stop":
            print("Message completed")

Gives fine-grained control over streaming lifecycle.

Streaming in FastAPI

Server-Sent Events Implementation

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import anthropic
import os
import asyncio
import json

app = FastAPI()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

class ChatRequest(BaseModel):
    message: str
    max_tokens: int = 2048

async def generate_stream(message: str, max_tokens: int):
    """Generate streaming response."""
    
    def _stream():
        with client.messages.stream(
            model="claude-3-5-sonnet-20241022",
            max_tokens=max_tokens,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            for text in stream.text_stream:
                yield text
    
    # Run in thread pool
    loop = asyncio.get_event_loop()
    
    def sync_gen():
        for chunk in _stream():
            yield chunk
    
    # Convert sync generator to async
    for chunk in await loop.run_in_executor(None, list, sync_gen()):
        yield f"data: {json.dumps({'text': chunk})}\n\n"
        await asyncio.sleep(0)  # Yield control

@app.post("/stream-chat")
async def stream_chat(request: ChatRequest):
    """Stream chat response."""
    
    return StreamingResponse(
        generate_stream(request.message, request.max_tokens),
        media_type="text/event-stream"
    )

Test with curl:

curl -X POST "http://localhost:8000/stream-chat" \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain Python async/await"}' \
  --no-buffer

Production Streaming Implementation

My production streaming endpoint:

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import List, Optional, AsyncGenerator
import anthropic
from anthropic import APIError
import os
import asyncio
import json
import logging

logger = logging.getLogger(__name__)

app = FastAPI()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

class Message(BaseModel):
    role: str = Field(..., pattern="^(user|assistant)$")
    content: str

class StreamChatRequest(BaseModel):
    messages: List[Message]
    max_tokens: int = Field(default=2048, ge=1, le=4096)
    temperature: float = Field(default=1.0, ge=0.0, le=1.0)
    system_prompt: Optional[str] = None

async def stream_claude_response(
    messages: List[dict],
    max_tokens: int,
    temperature: float,
    system_prompt: Optional[str]
) -> AsyncGenerator[str, None]:
    """Stream response from Claude."""
    
    def _create_stream():
        """Synchronous streaming call."""
        params = {
            "model": "claude-3-5-sonnet-20241022",
            "max_tokens": max_tokens,
            "temperature": temperature,
            "messages": messages
        }
        
        if system_prompt:
            params["system"] = system_prompt
        
        return client.messages.stream(**params)
    
    try:
        # Run in thread pool
        loop = asyncio.get_event_loop()
        stream = await loop.run_in_executor(None, _create_stream)
        
        # Track metrics
        input_tokens = 0
        output_tokens = 0
        
        with stream as s:
            # Stream text chunks
            for text in s.text_stream:
                # Send text chunk
                yield f"data: {json.dumps({'type': 'text', 'content': text})}\n\n"
                await asyncio.sleep(0)  # Yield control
            
            # Get final message for usage stats
            final_message = s.get_final_message()
            input_tokens = final_message.usage.input_tokens
            output_tokens = final_message.usage.output_tokens
        
        # Send completion event with usage
        yield f"data: {json.dumps({
            'type': 'done',
            'usage': {
                'input_tokens': input_tokens,
                'output_tokens': output_tokens,
                'total_tokens': input_tokens + output_tokens
            }
        })}\n\n"
        
    except APIError as e:
        logger.error(f"Claude API error: {e}")
        yield f"data: {json.dumps({'type': 'error', 'message': str(e)})}\n\n"
        
    except Exception as e:
        logger.error(f"Streaming error: {e}")
        yield f"data: {json.dumps({'type': 'error', 'message': 'Internal error'})}\n\n"

@app.post("/stream")
async def stream_chat(request: StreamChatRequest):
    """
    Stream chat response from Claude.
    
    Returns Server-Sent Events with:
    - text chunks as they're generated
    - final usage statistics
    """
    
    # Convert to dict
    messages = [msg.dict() for msg in request.messages]
    
    # Log request
    logger.info(f"Stream request - {len(messages)} messages, max_tokens={request.max_tokens}")
    
    return StreamingResponse(
        stream_claude_response(
            messages=messages,
            max_tokens=request.max_tokens,
            temperature=request.temperature,
            system_prompt=request.system_prompt
        ),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        }
    )

This is what I use in production - handles errors, tracks usage, proper SSE format.

Client-Side Streaming

JavaScript/TypeScript Client

// Stream chat response
async function streamChat(message: string) {
  const response = await fetch('http://localhost:8000/stream', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [
        { role: 'user', content: message }
      ],
      max_tokens: 2048
    })
  });

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();

  if (!reader) return;

  while (true) {
    const { done, value } = await reader.read();
    
    if (done) break;

    // Decode chunk
    const chunk = decoder.decode(value);
    
    // Parse SSE events
    const lines = chunk.split('\n');
    
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));
        
        if (data.type === 'text') {
          // Append text to UI
          console.log(data.content);
          appendToChat(data.content);
        } else if (data.type === 'done') {
          // Show usage stats
          console.log('Tokens used:', data.usage);
        } else if (data.type === 'error') {
          // Handle error
          console.error('Error:', data.message);
        }
      }
    }
  }
}

Python Client

import requests
import json

def stream_chat(message: str):
    """Stream chat response."""
    
    response = requests.post(
        "http://localhost:8000/stream",
        json={
            "messages": [
                {"role": "user", "content": message}
            ],
            "max_tokens": 2048
        },
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            
            if line.startswith('data: '):
                data = json.loads(line[6:])
                
                if data['type'] == 'text':
                    print(data['content'], end='', flush=True)
                    
                elif data['type'] == 'done':
                    print(f"\n\nTokens: {data['usage']['total_tokens']}")
                    
                elif data['type'] == 'error':
                    print(f"\nError: {data['message']}")

# Usage
stream_chat("Explain Python generators")

Context Window Management

Claude has a 200K token context window. Managing it properly is crucial.

Counting Tokens

import anthropic

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def count_tokens(text: str, model: str = "claude-3-5-sonnet-20241022") -> int:
    """Count tokens in text."""
    
    # Claude uses their own tokenizer
    # Approximation: 1 token ≈ 4 characters
    # For exact count, would need official tokenizer
    
    return len(text) // 4  # Rough estimate

# Better: Use actual API call to get exact count
def get_exact_token_count(messages: list) -> int:
    """Get exact token count from API."""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1,  # Minimal output
        messages=messages
    )
    
    return response.usage.input_tokens

I approximate in my apps since exact counting requires an API call.

Sliding Window for Long Conversations

from collections import deque
from typing import List, Dict

class ConversationManager:
    """Manage conversation history with token limits."""
    
    def __init__(self, max_tokens: int = 150000):
        self.max_tokens = max_tokens
        self.messages = deque()
        self.estimated_tokens = 0
    
    def estimate_tokens(self, text: str) -> int:
        """Estimate token count."""
        return len(text) // 4
    
    def add_message(self, role: str, content: str):
        """Add message to conversation."""
        
        tokens = self.estimate_tokens(content)
        
        # Add new message
        self.messages.append({
            "role": role,
            "content": content,
            "tokens": tokens
        })
        self.estimated_tokens += tokens
        
        # Remove old messages if over limit
        while self.estimated_tokens > self.max_tokens and len(self.messages) > 1:
            removed = self.messages.popleft()
            self.estimated_tokens -= removed["tokens"]
    
    def get_messages(self) -> List[Dict[str, str]]:
        """Get messages for API (without token count)."""
        return [
            {"role": msg["role"], "content": msg["content"]}
            for msg in self.messages
        ]
    
    def clear(self):
        """Clear conversation."""
        self.messages.clear()
        self.estimated_tokens = 0

# Usage
conversation = ConversationManager(max_tokens=100000)

conversation.add_message("user", "Hello!")
conversation.add_message("assistant", "Hi! How can I help?")
conversation.add_message("user", "Tell me about Python")

# Get messages for API
messages = conversation.get_messages()

I use this for chatbots - keeps conversation within context limits.

Summarization for Long Contexts

async def summarize_if_needed(
    messages: List[dict],
    max_tokens: int = 100000
) -> List[dict]:
    """Summarize conversation if too long."""
    
    # Estimate current tokens
    total_tokens = sum(len(msg["content"]) // 4 for msg in messages)
    
    if total_tokens < max_tokens:
        return messages
    
    # Keep system message and last few exchanges
    system_messages = [msg for msg in messages if msg.get("role") == "system"]
    recent_messages = messages[-6:]  # Last 3 exchanges
    
    # Get middle messages to summarize
    middle_messages = messages[len(system_messages):-6]
    
    if not middle_messages:
        return messages
    
    # Create summary
    summary_prompt = "Summarize this conversation concisely:\n\n"
    for msg in middle_messages:
        summary_prompt += f"{msg['role']}: {msg['content']}\n\n"
    
    # Get summary from Claude
    response = await call_claude_async(summary_prompt, max_tokens=500)
    
    # Reconstruct conversation with summary
    return system_messages + [
        {"role": "user", "content": "Previous conversation summary: " + response}
    ] + recent_messages

Useful for multi-hour conversations where full history exceeds context.

Advanced Prompt Engineering

Prompts make or break LLM applications.

Structured Output

system_prompt = """You are an API that extracts structured data.

Output ONLY valid JSON, no other text.

Format:
{
  "name": "extracted name",
  "email": "extracted email",
  "phone": "extracted phone"
}"""

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    system=system_prompt,
    messages=[{
        "role": "user",
        "content": "My name is John Doe, email [email protected], call me at 555-1234"
    }]
)

import json
data = json.loads(message.content[0].text)
print(data)  # {"name": "John Doe", "email": "[email protected]", "phone": "555-1234"}

Claude follows structure very well with clear instructions.

Few-Shot Prompting

system_prompt = """You are a sentiment analyzer.

Examples:
Input: "This product is amazing!"
Output: positive

Input: "Terrible service, very disappointed"
Output: negative

Input: "It's okay, nothing special"
Output: neutral

Now analyze the user's input."""

# Claude learns the pattern

I use this for classification tasks - 2-3 examples dramatically improve accuracy.

Chain of Thought

system_prompt = """When solving problems, think step by step.

Format your response as:
1. Understanding: Rephrase the problem
2. Approach: Explain your strategy
3. Solution: Provide the answer
4. Verification: Check if it makes sense"""

# Forces structured reasoning

Better for complex logic - reduces errors.

Production Prompt Template

From my document analysis API:

class PromptTemplate:
    """Reusable prompt templates."""
    
    DOCUMENT_ANALYSIS = """You are a document analysis expert.

Your task: Extract key information from documents accurately.

Guidelines:
- Only extract information that's explicitly stated
- Use "unknown" if information isn't present
- Preserve exact quotes when referencing text
- Output valid JSON only

Document type: {doc_type}
Required fields: {fields}

Output format:
{output_format}"""
    
    CODE_REVIEW = """You are an expert code reviewer.

Review criteria:
- Code quality and style
- Potential bugs
- Performance concerns
- Security issues
- Best practices

Provide:
1. Overall assessment (approve/request changes/reject)
2. Specific issues found
3. Suggested improvements

Be constructive and specific."""
    
    CUSTOMER_SUPPORT = """You are a helpful customer support agent for {company}.

Your knowledge:
{knowledge_base}

Guidelines:
- Be polite and professional
- Provide accurate information only
- If unsure, say so and offer to escalate
- Keep responses concise (<200 words)
- Include relevant links when helpful

Current context: {context}"""

# Usage
prompt = PromptTemplate.DOCUMENT_ANALYSIS.format(
    doc_type="Invoice",
    fields="invoice_number, date, total, vendor",
    output_format='{"invoice_number": "...", "date": "...", ...}'
)

Templates ensure consistency across different parts of my application.

Conversation State Management

In-Memory State (Simple)

from fastapi import FastAPI
from typing import Dict, List
import uuid

app = FastAPI()

# In-memory conversation storage
conversations: Dict[str, List[dict]] = {}

@app.post("/conversation/start")
async def start_conversation():
    """Start a new conversation."""
    conv_id = str(uuid.uuid4())
    conversations[conv_id] = []
    return {"conversation_id": conv_id}

@app.post("/conversation/{conv_id}/message")
async def send_message(conv_id: str, message: str):
    """Send message in conversation."""
    
    if conv_id not in conversations:
        raise HTTPException(status_code=404, detail="Conversation not found")
    
    # Add user message
    conversations[conv_id].append({
        "role": "user",
        "content": message
    })
    
    # Get response
    response = await call_claude_async(conversations[conv_id])
    
    # Add assistant message
    conversations[conv_id].append({
        "role": "assistant",
        "content": response
    })
    
    return {"response": response}

Good for single-server deployments, lost on restart.

Redis State (Production)

import redis
import json
from typing import List, Optional

class RedisConversationStore:
    """Store conversations in Redis."""
    
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600 * 24  # 24 hours
    
    def save_conversation(self, conv_id: str, messages: List[dict]):
        """Save conversation to Redis."""
        key = f"conversation:{conv_id}"
        self.redis.setex(
            key,
            self.ttl,
            json.dumps(messages)
        )
    
    def load_conversation(self, conv_id: str) -> Optional[List[dict]]:
        """Load conversation from Redis."""
        key = f"conversation:{conv_id}"
        data = self.redis.get(key)
        
        if data:
            return json.loads(data)
        return None
    
    def delete_conversation(self, conv_id: str):
        """Delete conversation."""
        key = f"conversation:{conv_id}"
        self.redis.delete(key)

# Usage
store = RedisConversationStore()

# Save
store.save_conversation("conv-123", [
    {"role": "user", "content": "Hello"}
])

# Load
messages = store.load_conversation("conv-123")

I use Redis in production - persistent, scalable across servers.

Best Practices

From building streaming LLM apps:

1. Always implement streaming for >2 second tasks 2. Handle stream interruptions gracefully 3. Send heartbeat events for long pauses 4. Track and limit conversation token usage 5. Use prompt templates for consistency

Common Issues

Problems I encountered:

1. Buffering delays streaming - Use flush=True 2. Connection timeouts - Set appropriate timeout values 3. Memory leaks from unclosed streams - Always use context managers 4. Token limit exceeded - Implement sliding window 5. Lost connection state - Use Redis or database

What's Next?

You now know how to implement streaming and manage advanced conversational features. In Part 4, we'll cover production patterns: caching, circuit breakers, prompt versioning, and cost optimization.

Next: Part 4 - Production Patterns and Best Practices

Previous: Part 2 - Building FastAPI Applications with Claude Series Home: LLM API Development 101

This article is part of the LLM API Development 101 series. All examples use Python 3 and FastAPI based on real production applications.

PreviousPart 2: Building FastAPI Applications with Claude NextPart 4: Production Patterns and Best Practices

Last updated 2 days ago

hashtagMy Streaming Epiphany

hashtagUnderstanding Streaming

hashtagWhy Stream?

hashtagHow Streaming Works

hashtagBasic Streaming with Claude

hashtagSimple Streaming Example

hashtagUnderstanding Stream Events

hashtagStreaming in FastAPI

hashtagServer-Sent Events Implementation

hashtagProduction Streaming Implementation

hashtagClient-Side Streaming

hashtagJavaScript/TypeScript Client

hashtagPython Client

hashtagContext Window Management

hashtagCounting Tokens

hashtagSliding Window for Long Conversations

hashtagSummarization for Long Contexts

hashtagAdvanced Prompt Engineering

hashtagStructured Output

hashtagFew-Shot Prompting

hashtagChain of Thought

hashtagProduction Prompt Template

hashtagConversation State Management

hashtagIn-Memory State (Simple)

hashtagRedis State (Production)

hashtagBest Practices

hashtagCommon Issues

hashtagWhat's Next?