ETL vs ELT Pipeline: Understanding the Differences

When I first started as a data engineering practitioner five years ago, I struggled to understand when to use ETL versus ELT pipelines. After building dozens of data pipelines with AWS, Databricks, and Python, I've learned the critical differences through both successes and painful failures.

In this post, I'll share my hands-on experience with both approaches and provide practical examples using the tools that have become essential in my data engineering toolkit: AWS S3, AWS Lambda, Python, Databricks, and PySpark.

Understanding ETL and ELT: Beyond the Acronyms

Let's start with the basics. Both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are data integration patterns, but they differ fundamentally in where the transformation happens.

ETL: The Traditional Approach I Started With

In ETL, the workflow follows these steps:

Extract: Pull data from source systems (databases, APIs, files)
Transform: Process the data in a separate transformation layer
Load: Move the clean, transformed data to its destination

When I first built ETL pipelines, I was working with sensitive healthcare data that required significant cleansing, validation, and masking before it could be stored in our data warehouse. ETL was perfect because it allowed us to handle these transformations before data ever reached the warehouse.

ELT: The Modern Approach That Changed My Work

In ELT, the sequence is:

Extract: Pull raw data from source systems
Load: Store this raw data in the target system immediately
Transform: Leverage the target system's processing power to transform data

My transition to ELT happened when I joined a tech startup building analytics on massive clickstream datasets. Trying to transform terabytes of data before loading would have been prohibitively expensive and slow.

My ETL Implementation: Solving a Real HIPAA Compliance Challenge

Last year, I faced a challenging project: building a pipeline to process sensitive healthcare data that needed to follow strict HIPAA compliance rules. The data arrived in encrypted CSV files and needed significant cleansing and transformation before being made available for analytics.

This was a perfect scenario for ETL since we needed to:

Decrypt the data securely
Validate and transform fields
Mask sensitive information
Apply standard formatting

Here's how I built this ETL pipeline using AWS Lambda and S3:

The Architecture I Designed

Source: Encrypted CSV files arriving in a secure S3 bucket
Trigger: S3 event notifications triggering AWS Lambda
Processing: Lambda function decrypting and transforming the data
Destination: Transformed data loaded to our analytics database

My AWS Lambda Function for the ETL Process

This is the actual code I wrote (with sensitive details changed) that handled our ETL process:

import boto3
import pandas as pd
import io
import base64
from cryptography.fernet import Fernet
import re

# Initialize AWS clients
s3 = boto3.client('s3')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    """
    My ETL function that:
    1. Extracts encrypted data from S3
    2. Decrypts it using keys from Parameter Store
    3. Transforms and cleanses the data
    4. Loads it to the destination database
    """
    # Get bucket and file info from the S3 event
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    source_key = event['Records'][0]['s3']['object']['key']
    
    print(f"Processing file {source_key} from bucket {source_bucket}")
    
    # Extract - Get the encrypted file from S3
    response = s3.get_object(Bucket=source_bucket, Key=source_key)
    encrypted_data = response['Body'].read()
    
    # Get decryption key from Parameter Store (more secure than hardcoding)
    decrypt_key_response = ssm.get_parameter(
        Name='/etl/decrypt-key',
        WithDecryption=True
    )
    decrypt_key = decrypt_key_response['Parameter']['Value']
    
    # Decrypt the data
    fernet = Fernet(decrypt_key.encode())
    decrypted_data = fernet.decrypt(encrypted_data)
    
    # Convert to pandas DataFrame for easier transformation
    df = pd.read_csv(io.BytesIO(decrypted_data))
    
    # Transform - Here's where the ETL magic happens
    # 1. Normalize patient names to uppercase
    df['patient_name'] = df['patient_name'].str.upper()
    
    # 2. Standardize phone number formats
    df['phone_number'] = df['phone_number'].apply(
        lambda x: re.sub(r'[^0-9]', '', str(x)) if pd.notnull(x) else None
    )
    
    # 3. Mask SSNs for HIPAA compliance (show only last 4 digits)
    df['ssn'] = df['ssn'].apply(
        lambda x: f"XXX-XX-{str(x)[-4:]}" if pd.notnull(x) and len(str(x)) >= 4 else None
    )
    
    # 4. Validate date formats and convert to standard format
    df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], errors='coerce').dt.strftime('%Y-%m-%d')
    
    # 5. Add data quality flags
    df['data_quality_score'] = df.apply(
        lambda row: calculate_quality_score(row), axis=1
    )
    
    # Load - Save transformed data to destination
    # For this example, I'll just save to another S3 bucket
    # In production, I'd typically load to a database
    output_buffer = io.StringIO()
    df.to_csv(output_buffer, index=False)
    
    s3.put_object(
        Body=output_buffer.getvalue(),
        Bucket='my-processed-data-bucket',
        Key=f"processed/{source_key.split('/')[-1]}"
    )
    
    return {
        'statusCode': 200,
        'body': f"Successfully processed {source_key}"
    }

def calculate_quality_score(row):
    """Helper function to calculate a simple data quality score"""
    score = 100
    
    # Reduce score for missing critical fields
    for field in ['patient_name', 'date_of_birth', 'diagnosis_code']:
        if pd.isnull(row[field]):
            score -= 25
    
    # Reduce score for potentially invalid data
    if pd.notnull(row['phone_number']) and len(row['phone_number']) != 10:
        score -= 10
        
    return max(0, score)  # Don't go below zero

My ELT Implementation: When Scale and Flexibility Matter

When I joined a fast-growing streaming service, our data needs were vastly different. We had:

Terabytes of user behavior data flowing in daily
Constantly changing business requirements for analytics
A team of data scientists who needed flexible access to raw data

This time, ELT made much more sense. Here's the ELT architecture I built using AWS S3, Databricks, and PySpark:

The Architecture I Created

Extract: Capture raw logs and events from our platforms
Load: Store everything in its raw form in our data lake on S3
Transform: Use Databricks to transform data on-demand for different needs

My Databricks ELT Implementation

Here's a simplified version of the PySpark code I wrote for our ELT pipeline:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, json_tuple, to_timestamp, unix_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, ArrayType

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Streaming Service ELT Pipeline") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# EXTRACT - Just read the raw data
# Notice we're not doing any transformation yet - that's the ELT difference!
raw_events_df = spark.read.format("json") \
    .option("inferSchema", "true") \
    .load("s3://my-data-lake/raw/user-events/year=2025/month=06/day=*/")

# LOAD - Save the data immediately in its raw form to our processed zone
# We're using Delta Lake format for ACID transactions and time travel capabilities
raw_events_df.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("event_date") \
    .save("s3://my-data-lake/processed/user-events/")

# Register as a table for SQL access
spark.sql("""
CREATE TABLE IF NOT EXISTS raw_user_events
USING DELTA
LOCATION 's3://my-data-lake/processed/user-events/'
""")

# TRANSFORM - Now we can create different views for different business needs
# All these transformations happen in Databricks, not before loading

# Example 1: Create a view for content recommendation team
spark.sql("""
CREATE OR REPLACE VIEW recommendation_team_view AS
SELECT 
    user_id,
    content_id,
    TO_TIMESTAMP(event_timestamp) as event_time,
    watch_duration_seconds,
    device_type,
    platform
FROM 
    raw_user_events
WHERE 
    event_type = 'content_watched'
    AND event_date >= CURRENT_DATE - INTERVAL 30 DAYS
""")

# Example 2: Create aggregated dataset for executive dashboards
exec_dashboard_df = spark.sql("""
SELECT
    DATE_TRUNC('day', TO_TIMESTAMP(event_timestamp)) as date,
    COUNT(DISTINCT user_id) as daily_active_users,
    COUNT(DISTINCT CASE WHEN event_type = 'subscription_started' THEN user_id END) as new_subscribers,
    COUNT(DISTINCT CASE WHEN event_type = 'subscription_canceled' THEN user_id END) as churned_subscribers,
    SUM(CASE WHEN event_type = 'content_watched' THEN watch_duration_seconds ELSE 0 END) / 3600 as total_watch_hours
FROM
    raw_user_events
WHERE
    event_date >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY
    DATE_TRUNC('day', TO_TIMESTAMP(event_timestamp))
ORDER BY
    date DESC
""")

# Save the executive dashboard dataset to a more optimized format
exec_dashboard_df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("s3://my-data-lake/gold/executive_dashboard/")

When to Use ETL vs. ELT: Lessons from My Experience

After working with both patterns extensively, here's what I've learned about when to use each approach:

Choose ETL When:

Data privacy is paramount - When I need to mask PII or apply encryption before storage
Source data quality is poor - When data needs significant cleansing before it's usable
Transformation is complex but data volume is manageable - When complex business rules must be applied to relatively small datasets
Target systems have limited computing power - When the destination system isn't designed for heavy processing

Choose ELT When:

Data volumes are massive - When processing terabytes before loading would be inefficient
Requirements change frequently - When business needs evolve and you need flexibility to reprocess raw data
You have a modern data platform - When using tools like Databricks, Snowflake, or BigQuery that excel at in-platform transformations
You need to support diverse analytical needs - When different teams need different views of the same raw data

The Hybrid Approach That Saved My Project

In my most recent project, I actually implemented a hybrid approach:

Light ETL - Using AWS Lambda to handle initial validation and sensitive data masking
ELT core - Loading semi-processed data to Databricks and using PySpark for most transformations

# Example Lambda code for the light ETL step
def lambda_handler(event, context):
    # Extract from S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Get the data
    response = s3.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    
    # Minimal transformation - just mask sensitive PII
    masked_content = mask_pii_fields(content)
    
    # Load to "silver" area for further processing in Databricks
    output_key = key.replace('bronze/', 'silver/')
    s3.put_object(Body=masked_content, Bucket=bucket, Key=output_key)
    
    # Trigger Databricks job for the ELT portion
    databricks_client.run_now(job_id=ELT_JOB_ID, notebook_params={
        "input_path": f"s3://{bucket}/{output_key}",
        "output_path": f"s3://{bucket}/{output_key.replace('silver/', 'gold/')}"
    })
    
    return {'statusCode': 200, 'body': 'Pipeline started successfully'}

The Future: How I'm Evolving My Data Pipelines

The line between ETL and ELT continues to blur. With tools like AWS Glue, Databricks, and modern cloud data warehouses, the decision isn't always binary. Here's what I'm focusing on for my next generation of data pipelines:

Data contracts to ensure quality at the source
Streaming architectures that blur the line between batch ETL and ELT
Automated testing and quality monitoring regardless of approach
Metadata-driven pipelines that can adapt their behavior based on the data itself

Both ETL and ELT have their place in the modern data engineering toolkit. The key is understanding your specific requirements around data volume, quality, privacy, and analytical needs.

What's your experience with ETL vs ELT? I'd love to hear about your challenges and solutions in the comments!

PreviousMedallion Architecture in Data Engineering NextOAuth 2.0 vs OIDC: Key Differences

Last updated 1 day ago