Part 3: Streaming Responses and Advanced Features

Part of the LLM API Development 101 Series

My Streaming Epiphany

Built a document summarization API. Users uploaded 50-page PDFs, waited...and waited...and waited 45 seconds for a complete response.

Users thought the app was broken. Support tickets piled up. "Is it working?" "Did it crash?"

Added streaming responses: Users saw text appearing word-by-word in real-time. Same 45-second processing time, but zero complaints.

Perception matters. Streaming transforms user experience from "is this broken?" to "wow, it's working!"

Let me show you how to implement it.

Understanding Streaming

Why Stream?

Traditional (non-streaming):

User sends request β†’ Wait 30 seconds β†’ Get complete response

User experience: Connection timeout, frustration, uncertainty.

Streaming:

User sends request β†’ Immediate first token β†’ Continuous word-by-word response

User experience: Instant feedback, perceived speed, engagement.

I use streaming for anything taking >2 seconds.

How Streaming Works

Server-Sent Events (SSE):

  • Server pushes data to client over HTTP

  • Connection stays open

  • Client receives events as they arrive

  • Standard text/event-stream format

Claude streaming:

  • Call API with stream=True

  • Receive events as model generates

  • Accumulate tokens into complete response

Basic Streaming with Claude

Simple Streaming Example

Output appears word-by-word as Claude generates it.

Understanding Stream Events

Gives fine-grained control over streaming lifecycle.

Streaming in FastAPI

Server-Sent Events Implementation

Test with curl:

Production Streaming Implementation

My production streaming endpoint:

This is what I use in production - handles errors, tracks usage, proper SSE format.

Client-Side Streaming

JavaScript/TypeScript Client

Python Client

Context Window Management

Claude has a 200K token context window. Managing it properly is crucial.

Counting Tokens

I approximate in my apps since exact counting requires an API call.

Sliding Window for Long Conversations

I use this for chatbots - keeps conversation within context limits.

Summarization for Long Contexts

Useful for multi-hour conversations where full history exceeds context.

Advanced Prompt Engineering

Prompts make or break LLM applications.

Structured Output

Claude follows structure very well with clear instructions.

Few-Shot Prompting

I use this for classification tasks - 2-3 examples dramatically improve accuracy.

Chain of Thought

Better for complex logic - reduces errors.

Production Prompt Template

From my document analysis API:

Templates ensure consistency across different parts of my application.

Conversation State Management

In-Memory State (Simple)

Good for single-server deployments, lost on restart.

Redis State (Production)

I use Redis in production - persistent, scalable across servers.

Best Practices

From building streaming LLM apps:

1. Always implement streaming for >2 second tasks 2. Handle stream interruptions gracefully 3. Send heartbeat events for long pauses 4. Track and limit conversation token usage 5. Use prompt templates for consistency

Common Issues

Problems I encountered:

1. Buffering delays streaming - Use flush=True 2. Connection timeouts - Set appropriate timeout values 3. Memory leaks from unclosed streams - Always use context managers 4. Token limit exceeded - Implement sliding window 5. Lost connection state - Use Redis or database

What's Next?

You now know how to implement streaming and manage advanced conversational features. In Part 4, we'll cover production patterns: caching, circuit breakers, prompt versioning, and cost optimization.

Next: Part 4 - Production Patterns and Best Practices


Previous: Part 2 - Building FastAPI Applications with Claude Series Home: LLM API Development 101

This article is part of the LLM API Development 101 series. All examples use Python 3 and FastAPI based on real production applications.

Last updated