Data Engineering Fundamentals

← Back to Data Engineering 101 | Next: Python for Data Engineering →

Introduction

When I tell people I'm a data engineer, the most common response is: "So... you do data science?" Not quite. While data scientists focus on extracting insights from data, data engineers build and maintain the infrastructure that makes that analysis possible.

This article covers what data engineering really is, based on my experience building systems that process millions of records daily.

What is Data Engineering?

My definition: Data engineering is the practice of designing, building, and maintaining systems that collect, store, process, and serve data reliably and at scale.

The plumbing analogy: If data is water and analytics are the faucets, data engineers build the pipes, pumps, filtration systems, and monitoring infrastructure that ensure clean water flows reliably when you turn on the tap.

The Data Engineering Lifecycle

From my experience, every data engineering project follows this lifecycle:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Source    │───▶│  Ingestion  │───▶│ Transform   │───▶│   Storage   │───▶│   Serving   │
│   Systems   │    │  (Extract)  │    │  (Process)  │    │   (Load)    │    │(Analytics)  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │                   │                   │
       └───────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                        Orchestration
                                     Quality & Monitoring

1. Source Systems

Databases (PostgreSQL, MySQL, MongoDB)
APIs (REST, GraphQL)
Files (CSV, JSON, Parquet)
Streaming platforms (Kafka, Kinesis)
Third-party services (Salesforce, Stripe, Google Analytics)

2. Ingestion

Extracting data from sources
Handling different formats and protocols
Managing authentication and rate limits
Implementing incremental loads vs full refreshes

3. Transformation

Cleaning and validating data
Joining data from multiple sources
Aggregating and computing metrics
Applying business logic

4. Storage

Data warehouses (Snowflake, Redshift, BigQuery)
Data lakes (S3, Azure Data Lake)
OLTP databases (for serving layers)
Caching layers (Redis, Memcached)

5. Serving

Exposing data via APIs
Powering dashboards and reports
Feeding ML models
Supporting real-time applications

The Role of a Data Engineer

Here's what I actually do day-to-day (not the job description version):

Core Responsibilities

Building Data Pipelines

# Python 3.12 - Example of a data pipeline structure I use
from typing import Protocol
from datetime import datetime
import logging

class DataPipeline(Protocol):
    """Interface for data pipelines."""
    
    def extract(self) -> dict:
        """Extract data from source."""
        ...
    
    def transform(self, data: dict) -> dict:
        """Transform extracted data."""
        ...
    
    def load(self, data: dict) -> None:
        """Load transformed data to destination."""
        ...
    
    def run(self) -> None:
        """Execute the complete ETL pipeline."""
        ...

class UserDataPipeline:
    """
    Real pipeline I built for processing user activity data.
    Runs every 15 minutes, processes ~50K records per run.
    """
    
    def __init__(self, db_conn, logger: logging.Logger):
        self.db = db_conn
        self.logger = logger
        self.run_id = datetime.utcnow().isoformat()
    
    def extract(self) -> dict:
        """Extract new user events from source database."""
        self.logger.info(f"Starting extraction - Run ID: {self.run_id}")
        
        try:
            # Get latest processed timestamp
            last_processed = self._get_last_processed_timestamp()
            
            # Extract only new records (incremental load)
            query = """
                SELECT 
                    user_id, 
                    event_type, 
                    event_timestamp,
                    event_properties
                FROM raw_events
                WHERE event_timestamp > %s
                ORDER BY event_timestamp
            """
            
            cursor = self.db.cursor()
            cursor.execute(query, (last_processed,))
            
            events = cursor.fetchall()
            self.logger.info(f"Extracted {len(events)} new events")
            
            return {
                "events": events,
                "extracted_at": datetime.utcnow(),
                "run_id": self.run_id
            }
            
        except Exception as e:
            self.logger.error(f"Extraction failed: {e}")
            raise
    
    def transform(self, data: dict) -> dict:
        """
        Transform events: 
        - Parse JSON properties
        - Enrich with user metadata
        - Calculate session metrics
        """
        events = data["events"]
        transformed = []
        
        for event in events:
            try:
                # Parse event properties from JSON
                import json
                properties = json.loads(event["event_properties"])
                
                # Enrich with user data
                user_info = self._get_user_info(event["user_id"])
                
                transformed_event = {
                    "user_id": event["user_id"],
                    "event_type": event["event_type"],
                    "timestamp": event["event_timestamp"],
                    "country": user_info.get("country"),
                    "subscription_tier": user_info.get("tier"),
                    **properties  # Flatten properties
                }
                
                transformed.append(transformed_event)
                
            except Exception as e:
                # Log but don't fail entire pipeline for one bad record
                self.logger.warning(
                    f"Failed to transform event {event['user_id']}: {e}"
                )
                continue
        
        self.logger.info(f"Transformed {len(transformed)} events successfully")
        
        return {
            **data,
            "transformed_events": transformed,
            "transformed_at": datetime.utcnow()
        }
    
    def load(self, data: dict) -> None:
        """Load transformed events to data warehouse."""
        events = data["transformed_events"]
        
        try:
            # Batch insert for performance
            cursor = self.db.cursor()
            
            insert_query = """
                INSERT INTO analytics.user_events 
                (user_id, event_type, timestamp, country, subscription_tier, properties)
                VALUES (%s, %s, %s, %s, %s, %s)
                ON CONFLICT (user_id, timestamp) DO UPDATE SET
                    event_type = EXCLUDED.event_type,
                    properties = EXCLUDED.properties
            """
            
            # Prepare batch
            batch = [
                (
                    e["user_id"],
                    e["event_type"],
                    e["timestamp"],
                    e.get("country"),
                    e.get("subscription_tier"),
                    json.dumps({k: v for k, v in e.items() 
                               if k not in ["user_id", "event_type", "timestamp"]})
                )
                for e in events
            ]
            
            cursor.executemany(insert_query, batch)
            self.db.commit()
            
            self.logger.info(f"Loaded {len(batch)} events to warehouse")
            
            # Update checkpoint
            self._update_checkpoint(data["extracted_at"])
            
        except Exception as e:
            self.db.rollback()
            self.logger.error(f"Load failed: {e}")
            raise
    
    def run(self) -> None:
        """Execute complete pipeline with error handling."""
        start_time = datetime.utcnow()
        
        try:
            # Extract
            extracted_data = self.extract()
            
            if not extracted_data["events"]:
                self.logger.info("No new events to process")
                return
            
            # Transform
            transformed_data = self.transform(extracted_data)
            
            # Load
            self.load(transformed_data)
            
            # Success metrics
            duration = (datetime.utcnow() - start_time).total_seconds()
            self.logger.info(
                f"Pipeline completed successfully in {duration:.2f}s"
            )
            
            # Log metrics for monitoring
            self._log_metrics({
                "run_id": self.run_id,
                "records_processed": len(extracted_data["events"]),
                "duration_seconds": duration,
                "status": "success"
            })
            
        except Exception as e:
            duration = (datetime.utcnow() - start_time).total_seconds()
            self.logger.error(f"Pipeline failed after {duration:.2f}s: {e}")
            
            # Log failure metrics
            self._log_metrics({
                "run_id": self.run_id,
                "duration_seconds": duration,
                "status": "failed",
                "error": str(e)
            })
            
            # Alert on-call engineer
            self._send_alert(f"Pipeline failed: {e}")
            raise
    
    def _get_last_processed_timestamp(self):
        """Get checkpoint from last successful run."""
        cursor = self.db.cursor()
        cursor.execute("""
            SELECT max_timestamp 
            FROM pipeline_checkpoints 
            WHERE pipeline_name = 'user_events'
        """)
        result = cursor.fetchone()
        return result[0] if result else datetime(2020, 1, 1)
    
    def _get_user_info(self, user_id: str) -> dict:
        """Fetch user metadata for enrichment."""
        # In production, this would use caching to avoid repeated DB calls
        cursor = self.db.cursor()
        cursor.execute("""
            SELECT country, subscription_tier 
            FROM users 
            WHERE user_id = %s
        """, (user_id,))
        result = cursor.fetchone()
        return {
            "country": result[0] if result else None,
            "tier": result[1] if result else "free"
        }
    
    def _update_checkpoint(self, timestamp: datetime) -> None:
        """Update pipeline checkpoint."""
        cursor = self.db.cursor()
        cursor.execute("""
            INSERT INTO pipeline_checkpoints (pipeline_name, max_timestamp, updated_at)
            VALUES ('user_events', %s, %s)
            ON CONFLICT (pipeline_name) DO UPDATE SET
                max_timestamp = EXCLUDED.max_timestamp,
                updated_at = EXCLUDED.updated_at
        """, (timestamp, datetime.utcnow()))
        self.db.commit()
    
    def _log_metrics(self, metrics: dict) -> None:
        """Send metrics to monitoring system."""
        # In production: send to Prometheus, CloudWatch, etc.
        self.logger.info(f"Metrics: {metrics}")
    
    def _send_alert(self, message: str) -> None:
        """Alert on-call engineer."""
        # In production: PagerDuty, Slack, email
        self.logger.critical(f"ALERT: {message}")

Data Modeling & Architecture

Designing for performance and scalability:

# Python 3.12 - Data modeling with SQLAlchemy
from sqlalchemy import (
    create_engine, Column, Integer, String, 
    DateTime, Float, ForeignKey, Index
)
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from datetime import datetime

Base = declarative_base()

class DimensionUser(Base):
    """
    Dimension table for user information.
    Slowly Changing Dimension (SCD) Type 2 - track history.
    """
    __tablename__ = 'dim_users'
    
    # Surrogate key
    user_key = Column(Integer, primary_key=True, autoincrement=True)
    
    # Natural key
    user_id = Column(String(50), nullable=False)
    
    # Attributes
    email = Column(String(255))
    country = Column(String(2))
    subscription_tier = Column(String(20))
    
    # SCD Type 2 fields
    valid_from = Column(DateTime, nullable=False, default=datetime.utcnow)
    valid_to = Column(DateTime)  # NULL = current version
    is_current = Column(Integer, default=1)  # 1 = current, 0 = historical
    
    # Indexes for performance
    __table_args__ = (
        Index('idx_user_id_current', 'user_id', 'is_current'),
        Index('idx_valid_dates', 'valid_from', 'valid_to'),
    )

class FactUserEvents(Base):
    """
    Fact table for user events.
    Partitioned by date for query performance.
    """
    __tablename__ = 'fact_user_events'
    
    # Composite primary key
    event_id = Column(String(50), primary_key=True)
    event_date = Column(DateTime, primary_key=True)  # Partition key
    
    # Foreign keys to dimensions
    user_key = Column(Integer, ForeignKey('dim_users.user_key'))
    
    # Measures (metrics)
    event_type = Column(String(50))
    session_duration_seconds = Column(Integer)
    page_views = Column(Integer)
    revenue = Column(Float)
    
    # Timestamp
    created_at = Column(DateTime, default=datetime.utcnow)
    
    # Relationships
    user = relationship("DimensionUser")
    
    # Indexes
    __table_args__ = (
        Index('idx_event_date', 'event_date'),
        Index('idx_user_key_date', 'user_key', 'event_date'),
        Index('idx_event_type', 'event_type'),
    )

# Usage: Create tables
def initialize_warehouse(connection_string: str):
    """Initialize data warehouse schema."""
    engine = create_engine(connection_string)
    Base.metadata.create_all(engine)
    
    # Create partitions (PostgreSQL example)
    with engine.connect() as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS fact_user_events_2026_01 
            PARTITION OF fact_user_events 
            FOR VALUES FROM ('2026-01-01') TO ('2026-02-01')
        """)

Data Quality & Monitoring

What I've learned: If you're not monitoring your data quality, you're shipping bad data to production.

# Python 3.12 - Data quality checks
from dataclasses import dataclass
from typing import List, Optional
from enum import Enum
import pandas as pd

class Severity(Enum):
    """Quality check severity levels."""
    WARNING = "warning"
    ERROR = "error"
    CRITICAL = "critical"

@dataclass
class QualityCheckResult:
    """Result of a data quality check."""
    check_name: str
    passed: bool
    severity: Severity
    message: str
    failed_records: int = 0
    total_records: int = 0

class DataQualityChecker:
    """
    Data quality framework I use in production.
    Catches 90%+ of data issues before they reach users.
    """
    
    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.results: List[QualityCheckResult] = []
    
    def check_nulls(self, column: str, severity: Severity = Severity.ERROR) -> QualityCheckResult:
        """Check for null values in critical columns."""
        null_count = self.df[column].isnull().sum()
        total_count = len(self.df)
        
        result = QualityCheckResult(
            check_name=f"null_check_{column}",
            passed=null_count == 0,
            severity=severity,
            message=f"Found {null_count} null values in {column}",
            failed_records=null_count,
            total_records=total_count
        )
        
        self.results.append(result)
        return result
    
    def check_duplicates(
        self, 
        columns: List[str], 
        severity: Severity = Severity.ERROR
    ) -> QualityCheckResult:
        """Check for duplicate records."""
        duplicate_count = self.df.duplicated(subset=columns).sum()
        
        result = QualityCheckResult(
            check_name="duplicate_check",
            passed=duplicate_count == 0,
            severity=severity,
            message=f"Found {duplicate_count} duplicate records",
            failed_records=duplicate_count,
            total_records=len(self.df)
        )
        
        self.results.append(result)
        return result
    
    def check_value_range(
        self, 
        column: str, 
        min_value: float, 
        max_value: float,
        severity: Severity = Severity.WARNING
    ) -> QualityCheckResult:
        """Check if values are within expected range."""
        out_of_range = (
            (self.df[column] < min_value) | 
            (self.df[column] > max_value)
        ).sum()
        
        result = QualityCheckResult(
            check_name=f"range_check_{column}",
            passed=out_of_range == 0,
            severity=severity,
            message=f"{out_of_range} values outside range [{min_value}, {max_value}]",
            failed_records=out_of_range,
            total_records=len(self.df)
        )
        
        self.results.append(result)
        return result
    
    def check_freshness(
        self, 
        timestamp_column: str, 
        max_age_hours: int = 24,
        severity: Severity = Severity.CRITICAL
    ) -> QualityCheckResult:
        """Check data freshness."""
        latest_timestamp = pd.to_datetime(self.df[timestamp_column]).max()
        age_hours = (datetime.utcnow() - latest_timestamp).total_seconds() / 3600
        
        result = QualityCheckResult(
            check_name="freshness_check",
            passed=age_hours <= max_age_hours,
            severity=severity,
            message=f"Data is {age_hours:.1f} hours old (max: {max_age_hours})",
            total_records=len(self.df)
        )
        
        self.results.append(result)
        return result
    
    def run_all_checks(self) -> bool:
        """
        Run all configured checks.
        Returns True if all critical checks pass.
        """
        critical_failures = [
            r for r in self.results 
            if not r.passed and r.severity == Severity.CRITICAL
        ]
        
        if critical_failures:
            for failure in critical_failures:
                logging.error(f"CRITICAL: {failure.message}")
            return False
        
        return True

# Usage
def validate_pipeline_output(df: pd.DataFrame) -> bool:
    """Validate data before loading to warehouse."""
    checker = DataQualityChecker(df)
    
    # Define checks based on business requirements
    checker.check_nulls("user_id", severity=Severity.CRITICAL)
    checker.check_nulls("timestamp", severity=Severity.CRITICAL)
    checker.check_duplicates(["user_id", "timestamp"], severity=Severity.ERROR)
    checker.check_value_range("revenue", min_value=0, max_value=10000)
    checker.check_freshness("timestamp", max_age_hours=2)
    
    return checker.run_all_checks()

Data Engineer vs Data Scientist vs Data Analyst

From my experience working with all three roles:

Aspect

Data Engineer

Data Scientist

Data Analyst

Focus

Infrastructure & pipelines

Models & insights

Reporting & analysis

Primary language

Python, SQL, Scala

Python, R

SQL, Excel, BI tools

Daily tasks

Build ETL pipelines, optimize queries, monitor systems

Train models, run experiments, deploy ML

Create dashboards, analyze trends, answer business questions

Success metric

Pipeline reliability, query performance, data freshness

Model accuracy, business impact

Actionable insights, stakeholder satisfaction

Pain points

3 AM pipeline failures, data quality issues, scaling challenges

Dirty data, production deployment, model drift

Data unavailability, unclear requirements

The reality: These roles overlap significantly. I write SQL queries (analyst work) and deploy ML models (data scientist work) regularly. The best teams have cross-functional skills.

Key Skills for Data Engineers

Based on what I use daily:

Technical Skills

1. Programming (Python)

Data manipulation: pandas, numpy
Database interaction: SQLAlchemy, psycopg2
API development: FastAPI, Flask
Testing: pytest, unittest

2. SQL & Databases

Advanced SQL (CTEs, window functions, query optimization)
Database design (normalization, indexing, partitioning)
Multiple database systems (PostgreSQL, MySQL, MongoDB)
Data warehouses (Snowflake, Redshift, BigQuery)

3. Data Tools

Orchestration: Apache Airflow, Prefect
Processing: Apache Spark, Pandas
Streaming: Kafka, Kinesis
Version control: Git

4. Cloud Platforms

AWS: S3, Redshift, Glue, Lambda
Azure: Data Factory, Synapse, Blob Storage
GCP: BigQuery, Dataflow, Cloud Storage

Soft Skills

Communication

Explaining technical concepts to non-technical stakeholders
Writing clear documentation
Collaborating with data scientists and analysts

Problem-solving

Debugging production issues under pressure
Optimizing slow queries
Designing scalable solutions

Business acumen

Understanding data requirements
Prioritizing features by impact
Balancing technical debt vs new features

The Data Engineering Mindset

After years in this field, these principles guide my work:

1. Think in pipelines Every data flow is a pipeline. Design for:

Idempotency (re-running doesn't cause issues)
Incremental processing (don't reprocess everything)
Failure recovery (graceful degradation)

2. Data quality is paramount Bad data is worse than no data. Always:

Validate inputs
Test transformations
Monitor outputs

3. Optimize for maintainability Code is read more than written. Prioritize:

Clear naming
Comprehensive documentation
Modular design

4. Automate everything If you do it twice, automate it:

Testing
Deployment
Monitoring

5. Monitor and alert You can't fix what you can't see:

Log everything
Track metrics
Set up alerts

Common Challenges

Real problems I've faced:

1. Data quality issues

Missing fields, incorrect formats, duplicate records
Solution: Comprehensive validation at ingestion

2. Scalability

Queries timing out, pipelines taking too long
Solution: Partitioning, indexing, incremental loads

3. Pipeline failures

Network issues, API rate limits, database locks
Solution: Retry logic, error handling, monitoring

4. Changing requirements

Schema changes, new data sources, evolving metrics
Solution: Flexible architecture, version control, testing

5. Data freshness

Delays in data availability, long-running jobs
Solution: Incremental processing, parallelization

Career Path

My journey:

Started as data analyst (SQL, Excel, dashboards)
Learned Python and automation
Built first ETL pipeline (learned about failures the hard way)
Architected data warehouse
Now: Lead data engineer managing infrastructure for multiple teams

Typical progression:

Junior Data Engineer → Data Engineer → Senior Data Engineer → Staff/Principal Engineer
Or: Data Engineer → Lead Data Engineer → Engineering Manager → Director of Data Engineering

Conclusion

Data engineering is challenging, rewarding, and constantly evolving. The fundamentals—pipelines, quality, scalability—remain constant even as tools change.

What I love:

Building systems that enable data-driven decisions
Solving complex technical challenges
Seeing the impact of reliable data infrastructure

What's hard:

On-call rotations and production incidents
Balancing technical debt with new features
Keeping up with rapidly evolving technology

My advice: Start simple, focus on fundamentals, learn from failures, and never stop building.

Navigation:

PreviousData Engineering 101 NextPython for Data Engineering

Last updated 1 month ago

hashtagIntroduction

hashtagWhat is Data Engineering?

hashtagThe Data Engineering Lifecycle

hashtagThe Role of a Data Engineer

hashtagCore Responsibilities

hashtagData Modeling & Architecture

hashtagData Quality & Monitoring

hashtagData Engineer vs Data Scientist vs Data Analyst

hashtagKey Skills for Data Engineers

hashtagTechnical Skills

hashtagSoft Skills

hashtagThe Data Engineering Mindset

hashtagCommon Challenges

hashtagCareer Path

hashtagConclusion