Streaming & Real-Time Data

← Previous: Cloud Data Platforms | Next: Data Engineering Best Practices →

Introduction

Batch processing is great for historical analysis, but modern applications often need real-time insights. Whether it's fraud detection, live dashboards, or personalized recommendations, streaming data processing has become essential.

I've worked extensively with Apache Kafka for event streaming and real-time pipelines. This article covers practical streaming patterns, Kafka fundamentals, and real production examples using Python 3.12. I'll focus on what actually works in production, not just theoretical concepts.

Batch vs Streaming Processing

Understanding the Trade-Offs

Batch Processing:

Processes data in scheduled intervals (hourly, daily)
Higher throughput, lower cost
Simpler to implement and debug
Acceptable latency (minutes to hours)
Examples: Daily reports, monthly aggregations, ML model training

Streaming Processing:

Processes data in real-time (milliseconds to seconds)
Lower latency, immediate insights
More complex infrastructure
Higher operational cost
Examples: Fraud detection, live dashboards, real-time recommendations

When to Use Streaming

From my experience, use streaming when:

Low latency is critical: Fraud detection, trading systems
Event-driven architecture: Microservices communication
Real-time analytics: Live dashboards, monitoring
Continuous processing: IoT sensor data, clickstream analysis

Don't use streaming when:

Batch processing meets your latency requirements
Simple transformations on complete datasets
Cost is a primary concern
Team lacks streaming expertise

Apache Kafka Fundamentals

Kafka is the industry-standard distributed event streaming platform. I use it for:

Event streaming: Publish-subscribe messaging
Data integration: Connect disparate systems
Stream processing: Real-time transformations

Core Concepts

Topic: Category for messages (like a database table)
Partition: Ordered, immutable sequence of messages within a topic
Producer: Publishes messages to topics
Consumer: Subscribes to topics and processes messages
Consumer Group: Load balancing across multiple consumers
Broker: Kafka server storing data
ZooKeeper/KRaft: Cluster coordination (KRaft is newer)

Setting Up Kafka with Docker

# docker-compose.yml
version: '3.8'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - "2181:2181"

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1

Kafka Producer in Python

# kafka_producer.py
from kafka import KafkaProducer
from kafka.errors import KafkaError
import json
from typing import Dict, Any, Optional
from datetime import datetime
import uuid
from dataclasses import dataclass, asdict

@dataclass
class UserEvent:
    """User event schema"""
    event_id: str
    user_id: int
    event_type: str
    timestamp: str
    properties: Dict[str, Any]

class EventProducer:
    """Kafka producer for user events"""
    
    def __init__(
        self,
        bootstrap_servers: list[str],
        topic: str
    ):
        """
        Initialize Kafka producer.
        
        Args:
            bootstrap_servers: List of Kafka broker addresses
            topic: Topic to publish to
        """
        self.topic = topic
        
        # Create producer with JSON serialization
        self.producer = KafkaProducer(
            bootstrap_servers=bootstrap_servers,
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
            key_serializer=lambda k: k.encode('utf-8') if k else None,
            # Reliability settings
            acks='all',  # Wait for all replicas
            retries=3,
            max_in_flight_requests_per_connection=1,  # Preserve ordering
            # Performance settings
            compression_type='gzip',
            batch_size=16384,
            linger_ms=10  # Wait up to 10ms to batch messages
        )
    
    def send_event(
        self,
        user_id: int,
        event_type: str,
        properties: Dict[str, Any],
        key: Optional[str] = None
    ) -> None:
        """
        Send user event to Kafka.
        
        Args:
            user_id: User identifier
            event_type: Type of event (e.g., 'page_view', 'purchase')
            properties: Event properties
            key: Partition key (defaults to user_id)
        """
        event = UserEvent(
            event_id=str(uuid.uuid4()),
            user_id=user_id,
            event_type=event_type,
            timestamp=datetime.utcnow().isoformat(),
            properties=properties
        )
        
        # Use user_id as key for partitioning (ensures ordering per user)
        partition_key = key or str(user_id)
        
        # Send asynchronously with callback
        future = self.producer.send(
            self.topic,
            key=partition_key,
            value=asdict(event)
        )
        
        # Add callback for error handling
        future.add_callback(self._on_send_success)
        future.add_errback(self._on_send_error)
    
    def _on_send_success(self, metadata) -> None:
        """Callback on successful send"""
        print(f"Message sent to {metadata.topic} partition {metadata.partition} offset {metadata.offset}")
    
    def _on_send_error(self, error: Exception) -> None:
        """Callback on send error"""
        print(f"Error sending message: {error}")
    
    def flush(self) -> None:
        """Flush pending messages"""
        self.producer.flush()
    
    def close(self) -> None:
        """Close producer"""
        self.producer.close()


# Example usage
if __name__ == "__main__":
    producer = EventProducer(
        bootstrap_servers=['localhost:9092'],
        topic='user-events'
    )
    
    # Send page view event
    producer.send_event(
        user_id=12345,
        event_type='page_view',
        properties={
            'page': '/products/laptop',
            'referrer': '/home',
            'session_id': 'abc123'
        }
    )
    
    # Send purchase event
    producer.send_event(
        user_id=12345,
        event_type='purchase',
        properties={
            'product_id': 789,
            'amount': 999.99,
            'currency': 'USD'
        }
    )
    
    # Ensure all messages are sent
    producer.flush()
    producer.close()

Kafka Consumer in Python

# kafka_consumer.py
from kafka import KafkaConsumer
from kafka.errors import KafkaError
import json
from typing import Callable, Dict, Any
from datetime import datetime
import signal
import sys

class EventConsumer:
    """Kafka consumer for processing events"""
    
    def __init__(
        self,
        bootstrap_servers: list[str],
        topic: str,
        group_id: str,
        processor: Callable[[Dict[str, Any]], None]
    ):
        """
        Initialize Kafka consumer.
        
        Args:
            bootstrap_servers: List of Kafka broker addresses
            topic: Topic to subscribe to
            group_id: Consumer group ID
            processor: Function to process each message
        """
        self.topic = topic
        self.processor = processor
        self.running = True
        
        # Create consumer
        self.consumer = KafkaConsumer(
            topic,
            bootstrap_servers=bootstrap_servers,
            group_id=group_id,
            # Deserialization
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),
            key_deserializer=lambda k: k.decode('utf-8') if k else None,
            # Offset management
            auto_offset_reset='earliest',  # Start from beginning if no offset
            enable_auto_commit=False,  # Manual commit for reliability
            # Performance
            max_poll_records=500,
            max_poll_interval_ms=300000  # 5 minutes
        )
        
        # Handle graceful shutdown
        signal.signal(signal.SIGINT, self._signal_handler)
        signal.signal(signal.SIGTERM, self._signal_handler)
    
    def _signal_handler(self, signum, frame) -> None:
        """Handle shutdown signals"""
        print("\nShutting down consumer...")
        self.running = False
    
    def consume(self) -> None:
        """
        Start consuming messages.
        Processes messages and commits offsets manually.
        """
        print(f"Starting consumer for topic: {self.topic}")
        
        try:
            while self.running:
                # Poll for messages (timeout in milliseconds)
                messages = self.consumer.poll(timeout_ms=1000)
                
                for topic_partition, records in messages.items():
                    for record in records:
                        try:
                            # Process message
                            self.processor(record.value)
                            
                            # Commit offset after successful processing
                            self.consumer.commit()
                            
                        except Exception as e:
                            print(f"Error processing message: {e}")
                            # In production, send to dead letter queue
                            
        finally:
            print("Closing consumer...")
            self.consumer.close()


# Example processor function
def process_user_event(event: Dict[str, Any]) -> None:
    """
    Process user event.
    
    Args:
        event: Event dictionary
    """
    event_type = event.get('event_type')
    user_id = event.get('user_id')
    
    print(f"Processing {event_type} for user {user_id}")
    
    # Example: Save to database
    if event_type == 'purchase':
        # save_to_database(event)
        print(f"Saved purchase: ${event['properties'].get('amount')}")
    
    elif event_type == 'page_view':
        # update_analytics(event)
        print(f"Tracked page view: {event['properties'].get('page')}")


# Example usage
if __name__ == "__main__":
    consumer = EventConsumer(
        bootstrap_servers=['localhost:9092'],
        topic='user-events',
        group_id='event-processor-group',
        processor=process_user_event
    )
    
    consumer.consume()

Stream Processing Patterns

Windowing

Windowing aggregates streaming data over time intervals:

# stream_windowing.py
from collections import defaultdict
from datetime import datetime, timedelta
from typing import Dict, List, Any
from dataclasses import dataclass
import time

@dataclass
class WindowedAggregate:
    """Windowed aggregate result"""
    window_start: datetime
    window_end: datetime
    count: int
    total_amount: float

class TumblingWindowProcessor:
    """
    Tumbling window: Fixed-size, non-overlapping windows.
    Example: 5-minute windows [0-5], [5-10], [10-15]
    """
    
    def __init__(self, window_size_seconds: int):
        """
        Initialize tumbling window processor.
        
        Args:
            window_size_seconds: Window size in seconds
        """
        self.window_size = timedelta(seconds=window_size_seconds)
        self.windows: Dict[datetime, List[Dict[str, Any]]] = defaultdict(list)
    
    def get_window_key(self, timestamp: datetime) -> datetime:
        """Get window start time for timestamp"""
        epoch = datetime(1970, 1, 1)
        delta = timestamp - epoch
        window_num = int(delta.total_seconds() / self.window_size.total_seconds())
        return epoch + (self.window_size * window_num)
    
    def add_event(self, event: Dict[str, Any]) -> None:
        """Add event to appropriate window"""
        timestamp = datetime.fromisoformat(event['timestamp'])
        window_key = self.get_window_key(timestamp)
        self.windows[window_key].append(event)
    
    def process_window(self, window_key: datetime) -> WindowedAggregate:
        """
        Process completed window.
        
        Args:
            window_key: Window start time
            
        Returns:
            Aggregated results
        """
        events = self.windows[window_key]
        
        # Calculate aggregates
        purchases = [e for e in events if e['event_type'] == 'purchase']
        total_amount = sum(
            e['properties']['amount'] 
            for e in purchases
        )
        
        result = WindowedAggregate(
            window_start=window_key,
            window_end=window_key + self.window_size,
            count=len(purchases),
            total_amount=total_amount
        )
        
        # Clear processed window
        del self.windows[window_key]
        
        return result
    
    def get_completed_windows(self, current_time: datetime) -> List[datetime]:
        """Get windows that should be processed"""
        return [
            window_key
            for window_key in self.windows.keys()
            if window_key + self.window_size <= current_time
        ]


class SlidingWindowProcessor:
    """
    Sliding window: Overlapping windows that slide over time.
    Example: 5-minute windows every 1 minute [0-5], [1-6], [2-7]
    """
    
    def __init__(
        self,
        window_size_seconds: int,
        slide_interval_seconds: int
    ):
        """
        Initialize sliding window processor.
        
        Args:
            window_size_seconds: Window size
            slide_interval_seconds: How often windows slide
        """
        self.window_size = timedelta(seconds=window_size_seconds)
        self.slide_interval = timedelta(seconds=slide_interval_seconds)
        self.events: List[Dict[str, Any]] = []
    
    def add_event(self, event: Dict[str, Any]) -> None:
        """Add event to buffer"""
        self.events.append(event)
    
    def get_window_data(self, window_end: datetime) -> List[Dict[str, Any]]:
        """Get events within window"""
        window_start = window_end - self.window_size
        
        return [
            event for event in self.events
            if window_start <= datetime.fromisoformat(event['timestamp']) < window_end
        ]
    
    def cleanup_old_events(self, current_time: datetime) -> None:
        """Remove events outside retention window"""
        cutoff = current_time - self.window_size
        self.events = [
            event for event in self.events
            if datetime.fromisoformat(event['timestamp']) >= cutoff
        ]


# Example usage
if __name__ == "__main__":
    # Tumbling window: 60-second windows
    processor = TumblingWindowProcessor(window_size_seconds=60)
    
    # Simulate events
    events = [
        {
            'event_type': 'purchase',
            'user_id': 1,
            'timestamp': '2024-01-15T10:00:30',
            'properties': {'amount': 99.99}
        },
        {
            'event_type': 'purchase',
            'user_id': 2,
            'timestamp': '2024-01-15T10:00:45',
            'properties': {'amount': 149.99}
        },
        {
            'event_type': 'purchase',
            'user_id': 3,
            'timestamp': '2024-01-15T10:01:15',  # Next window
            'properties': {'amount': 199.99}
        },
    ]
    
    for event in events:
        processor.add_event(event)
    
    # Process completed windows
    current_time = datetime(2024, 1, 15, 10, 2, 0)
    completed = processor.get_completed_windows(current_time)
    
    for window_key in completed:
        result = processor.process_window(window_key)
        print(f"Window [{result.window_start} - {result.window_end}]: "
              f"{result.count} purchases, ${result.total_amount:.2f}")

Stateful Processing

Maintaining state across events:

# stateful_processing.py
from typing import Dict, Any, Optional
from datetime import datetime, timedelta
from collections import defaultdict

class SessionProcessor:
    """
    Process user sessions with timeout-based sessionization.
    Creates new session if gap > session_timeout.
    """
    
    def __init__(self, session_timeout_minutes: int = 30):
        """
        Initialize session processor.
        
        Args:
            session_timeout_minutes: Session timeout in minutes
        """
        self.session_timeout = timedelta(minutes=session_timeout_minutes)
        
        # Track user sessions
        self.user_sessions: Dict[int, Dict[str, Any]] = {}
    
    def process_event(self, event: Dict[str, Any]) -> Optional[Dict[str, Any]]:
        """
        Process event and update session state.
        
        Args:
            event: User event
            
        Returns:
            Completed session if session ended, None otherwise
        """
        user_id = event['user_id']
        event_time = datetime.fromisoformat(event['timestamp'])
        
        # Get or create session
        if user_id not in self.user_sessions:
            # New session
            self.user_sessions[user_id] = {
                'session_id': f"{user_id}_{event_time.timestamp()}",
                'user_id': user_id,
                'start_time': event_time,
                'last_event_time': event_time,
                'events': [event],
                'event_count': 1
            }
            return None
        
        session = self.user_sessions[user_id]
        time_since_last = event_time - session['last_event_time']
        
        if time_since_last > self.session_timeout:
            # Session timed out - return completed session
            completed_session = session.copy()
            completed_session['end_time'] = session['last_event_time']
            completed_session['duration'] = (
                session['last_event_time'] - session['start_time']
            ).total_seconds()
            
            # Start new session
            self.user_sessions[user_id] = {
                'session_id': f"{user_id}_{event_time.timestamp()}",
                'user_id': user_id,
                'start_time': event_time,
                'last_event_time': event_time,
                'events': [event],
                'event_count': 1
            }
            
            return completed_session
        
        else:
            # Continue existing session
            session['last_event_time'] = event_time
            session['events'].append(event)
            session['event_count'] += 1
            
            return None


class AggregationProcessor:
    """Real-time aggregation with state management"""
    
    def __init__(self):
        """Initialize aggregation processor"""
        # User-level metrics
        self.user_metrics: Dict[int, Dict[str, Any]] = defaultdict(
            lambda: {
                'total_purchases': 0,
                'total_spent': 0.0,
                'page_views': 0,
                'last_activity': None
            }
        )
    
    def process_event(self, event: Dict[str, Any]) -> Dict[str, Any]:
        """
        Update user metrics based on event.
        
        Args:
            event: User event
            
        Returns:
            Updated user metrics
        """
        user_id = event['user_id']
        event_type = event['event_type']
        metrics = self.user_metrics[user_id]
        
        # Update metrics based on event type
        if event_type == 'purchase':
            metrics['total_purchases'] += 1
            metrics['total_spent'] += event['properties'].get('amount', 0)
        
        elif event_type == 'page_view':
            metrics['page_views'] += 1
        
        metrics['last_activity'] = event['timestamp']
        
        return metrics.copy()
    
    def get_user_metrics(self, user_id: int) -> Dict[str, Any]:
        """Get current metrics for user"""
        return self.user_metrics[user_id].copy()


# Example usage
if __name__ == "__main__":
    session_processor = SessionProcessor(session_timeout_minutes=30)
    agg_processor = AggregationProcessor()
    
    # Simulate event stream
    events = [
        {'user_id': 1, 'event_type': 'page_view', 'timestamp': '2024-01-15T10:00:00', 'properties': {}},
        {'user_id': 1, 'event_type': 'page_view', 'timestamp': '2024-01-15T10:05:00', 'properties': {}},
        {'user_id': 1, 'event_type': 'purchase', 'timestamp': '2024-01-15T10:10:00', 'properties': {'amount': 99.99}},
        {'user_id': 1, 'event_type': 'page_view', 'timestamp': '2024-01-15T10:45:00', 'properties': {}},  # New session (35 min gap)
    ]
    
    for event in events:
        # Process session
        completed = session_processor.process_event(event)
        if completed:
            print(f"Session completed: {completed['session_id']}, "
                  f"duration: {completed['duration']:.0f}s, "
                  f"events: {completed['event_count']}")
        
        # Update aggregates
        metrics = agg_processor.process_event(event)
        print(f"User {event['user_id']} metrics: {metrics}")

Kafka Streams Processing

For more complex stream processing, use Kafka Streams or similar frameworks. Here's a conceptual example:

# kafka_streams_example.py
"""
This is a conceptual example. In production, you would use:
- Kafka Streams (Java)
- Apache Flink (Python/Java)
- Spark Structured Streaming (Python/Scala)
"""

from kafka import KafkaConsumer, KafkaProducer
import json
from typing import Dict, Any
from collections import defaultdict
from datetime import datetime, timedelta

class SimpleStreamProcessor:
    """
    Simple stream processor for demonstration.
    In production, use Kafka Streams or Flink.
    """
    
    def __init__(
        self,
        input_topic: str,
        output_topic: str,
        bootstrap_servers: list[str]
    ):
        """Initialize stream processor"""
        self.consumer = KafkaConsumer(
            input_topic,
            bootstrap_servers=bootstrap_servers,
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),
            group_id='stream-processor'
        )
        
        self.producer = KafkaProducer(
            bootstrap_servers=bootstrap_servers,
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
        
        self.output_topic = output_topic
        
        # State: user metrics per 5-minute window
        self.window_size = timedelta(minutes=5)
        self.windows: Dict[datetime, Dict[int, int]] = defaultdict(
            lambda: defaultdict(int)
        )
    
    def get_window_key(self, timestamp: str) -> datetime:
        """Get window start time"""
        ts = datetime.fromisoformat(timestamp)
        epoch = datetime(1970, 1, 1)
        delta = ts - epoch
        window_num = int(delta.total_seconds() / self.window_size.total_seconds())
        return epoch + (self.window_size * window_num)
    
    def process(self) -> None:
        """Process stream"""
        for message in self.consumer:
            event = message.value
            
            # Extract data
            user_id = event['user_id']
            event_type = event['event_type']
            timestamp = event['timestamp']
            
            # Only process purchases
            if event_type != 'purchase':
                continue
            
            # Update window aggregate
            window_key = self.get_window_key(timestamp)
            self.windows[window_key][user_id] += 1
            
            # Check if window is complete
            current_time = datetime.utcnow()
            completed_windows = [
                wk for wk in self.windows.keys()
                if wk + self.window_size <= current_time
            ]
            
            # Emit completed windows
            for window_key in completed_windows:
                window_data = self.windows[window_key]
                
                result = {
                    'window_start': window_key.isoformat(),
                    'window_end': (window_key + self.window_size).isoformat(),
                    'user_purchase_counts': dict(window_data),
                    'total_purchases': sum(window_data.values())
                }
                
                self.producer.send(self.output_topic, value=result)
                
                # Clean up
                del self.windows[window_key]


# In production, use Kafka Streams DSL (Java) or Flink for this:
"""
// Kafka Streams example (Java)
StreamsBuilder builder = new StreamsBuilder();

KStream<String, UserEvent> events = builder.stream("user-events");

KTable<Windowed<Integer>, Long> purchaseCounts = events
    .filter((key, event) -> event.getEventType().equals("purchase"))
    .groupBy((key, event) -> event.getUserId())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count();

purchaseCounts.toStream().to("purchase-counts");
"""

Best Practices for Streaming

From my production experience:

Idempotency: Design consumers to handle duplicate messages
Exactly-once semantics: Use Kafka transactions when needed
Backpressure: Handle slow consumers gracefully
Monitoring: Track lag, throughput, error rates
Schema evolution: Use schema registry (Avro/Protobuf)
Partitioning strategy: Choose partition keys carefully for parallelism
Error handling: Implement dead letter queues
State management: Persist state for recovery
Testing: Test with realistic data volumes

Key Takeaways

Kafka: Industry standard for event streaming
Windowing: Essential for time-based aggregations
Stateful processing: Maintain context across events
Consumer groups: Enable parallel processing
Exactly-once: Use transactions for critical applications
Monitoring: Track lag and throughput in production
Schema registry: Manage schema evolution

← Previous: Cloud Data Platforms | Next: Data Engineering Best Practices →

PreviousCloud Data Platforms NextData Engineering Best Practices

Last updated 1 month ago

hashtagIntroduction

hashtagBatch vs Streaming Processing

hashtagUnderstanding the Trade-Offs

hashtagWhen to Use Streaming

hashtagApache Kafka Fundamentals

hashtagCore Concepts

hashtagSetting Up Kafka with Docker

hashtagKafka Producer in Python

hashtagKafka Consumer in Python

hashtagStream Processing Patterns

hashtagWindowing

hashtagStateful Processing

hashtagKafka Streams Processing

hashtagBest Practices for Streaming

hashtagKey Takeaways