Data Ingestion Sources

← Previous: Python for Data Engineering | Back to Index | Next: Data Cleaning & Transformation →

Introduction

Data ingestion is where every data pipeline begins. In my experience, 40% of data engineering work involves getting data from various sources reliably. This article covers the ingestion patterns I use in production.

Reading Files

CSV Files

# Python 3.12 - Production CSV reading
import pandas as pd
from pathlib import Path
from typing import Optional
import logging

class CSVIngestion:
    """
    CSV ingestion with error handling.
    Handles files from MB to GB in size.
    """
    
    @staticmethod
    def read_csv_robust(
        file_path: Path,
        chunk_size: Optional[int] = None
    ) -> pd.DataFrame:
        """Read CSV with comprehensive error handling."""
        try:
            # For small files
            if chunk_size is None:
                return pd.read_csv(
                    file_path,
                    encoding='utf-8',
                    na_values=['NULL', 'null', 'N/A', '#N/A', 'nan'],
                    keep_default_na=True,
                    parse_dates=True,
                    infer_datetime_format=True
                )
            
            # For large files - read in chunks
            chunks = []
            for chunk in pd.read_csv(file_path, chunksize=chunk_size):
                chunks.append(chunk)
            
            return pd.concat(chunks, ignore_index=True)
            
        except UnicodeDecodeError:
            # Try different encoding
            logging.warning(f"UTF-8 failed, trying latin-1 for {file_path}")
            return pd.read_csv(file_path, encoding='latin-1')
        
        except Exception as e:
            logging.error(f"Failed to read {file_path}: {e}")
            raise

# Usage
df = CSVIngestion.read_csv_robust(Path('data/transactions.csv'))

Excel Files

JSON Files

Database Connections

PostgreSQL

API Integration

REST APIs

GraphQL APIs

Cloud Storage

AWS S3

Streaming Data

Kafka Consumer

Data Ingestion Patterns

Full Load

Incremental Load

Best Practices

1. Always use connection pooling

2. Implement retry logic

3. Use appropriate formats

  • CSV: Human-readable, slow

  • Parquet: Columnar, fast, compressed (my preference)

  • JSON: Nested data, APIs

4. Monitor data freshness

Conclusion

Data ingestion is the foundation of every pipeline. The patterns shown here handle 90% of real-world scenarios I encounter.

Key takeaways:

  • Use incremental loads for large datasets

  • Implement retry logic for API calls

  • Monitor data freshness

  • Choose the right format (Parquet > CSV)

  • Always handle errors gracefully


Navigation:

Last updated