Data Ingestion Sources
Introduction
Reading Files
CSV Files
# Python 3.12 - Production CSV reading
import pandas as pd
from pathlib import Path
from typing import Optional
import logging
class CSVIngestion:
"""
CSV ingestion with error handling.
Handles files from MB to GB in size.
"""
@staticmethod
def read_csv_robust(
file_path: Path,
chunk_size: Optional[int] = None
) -> pd.DataFrame:
"""Read CSV with comprehensive error handling."""
try:
# For small files
if chunk_size is None:
return pd.read_csv(
file_path,
encoding='utf-8',
na_values=['NULL', 'null', 'N/A', '#N/A', 'nan'],
keep_default_na=True,
parse_dates=True,
infer_datetime_format=True
)
# For large files - read in chunks
chunks = []
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
chunks.append(chunk)
return pd.concat(chunks, ignore_index=True)
except UnicodeDecodeError:
# Try different encoding
logging.warning(f"UTF-8 failed, trying latin-1 for {file_path}")
return pd.read_csv(file_path, encoding='latin-1')
except Exception as e:
logging.error(f"Failed to read {file_path}: {e}")
raise
# Usage
df = CSVIngestion.read_csv_robust(Path('data/transactions.csv'))Excel Files
JSON Files
Database Connections
PostgreSQL
API Integration
REST APIs
GraphQL APIs
Cloud Storage
AWS S3
Streaming Data
Kafka Consumer
Data Ingestion Patterns
Full Load
Incremental Load
Best Practices
Conclusion
Last updated