ETL vs ELT Pipeline: Understanding the Differences
When I first started as a data engineering practitioner five years ago, I struggled to understand when to use ETL versus ELT pipelines. After building dozens of data pipelines with AWS, Databricks, and Python, I've learned the critical differences through both successes and painful failures.
In this post, I'll share my hands-on experience with both approaches and provide practical examples using the tools that have become essential in my data engineering toolkit: AWS S3, AWS Lambda, Python, Databricks, and PySpark.
Understanding ETL and ELT: Beyond the Acronyms
Let's start with the basics. Both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are data integration patterns, but they differ fundamentally in where the transformation happens.
ETL: The Traditional Approach I Started With
In ETL, the workflow follows these steps:
Extract: Pull data from source systems (databases, APIs, files)
Transform: Process the data in a separate transformation layer
Load: Move the clean, transformed data to its destination
When I first built ETL pipelines, I was working with sensitive healthcare data that required significant cleansing, validation, and masking before it could be stored in our data warehouse. ETL was perfect because it allowed us to handle these transformations before data ever reached the warehouse.
ELT: The Modern Approach That Changed My Work
In ELT, the sequence is:
Extract: Pull raw data from source systems
Load: Store this raw data in the target system immediately
Transform: Leverage the target system's processing power to transform data
My transition to ELT happened when I joined a tech startup building analytics on massive clickstream datasets. Trying to transform terabytes of data before loading would have been prohibitively expensive and slow.
My ETL Implementation: Solving a Real HIPAA Compliance Challenge
Last year, I faced a challenging project: building a pipeline to process sensitive healthcare data that needed to follow strict HIPAA compliance rules. The data arrived in encrypted CSV files and needed significant cleansing and transformation before being made available for analytics.
This was a perfect scenario for ETL since we needed to:
Decrypt the data securely
Validate and transform fields
Mask sensitive information
Apply standard formatting
Here's how I built this ETL pipeline using AWS Lambda and S3:
The Architecture I Designed
Source: Encrypted CSV files arriving in a secure S3 bucket
Trigger: S3 event notifications triggering AWS Lambda
Processing: Lambda function decrypting and transforming the data
Destination: Transformed data loaded to our analytics database
My AWS Lambda Function for the ETL Process
This is the actual code I wrote (with sensitive details changed) that handled our ETL process:
import boto3
import pandas as pd
import io
import base64
from cryptography.fernet import Fernet
import re
# Initialize AWS clients
s3 = boto3.client('s3')
ssm = boto3.client('ssm')
def lambda_handler(event, context):
"""
My ETL function that:
1. Extracts encrypted data from S3
2. Decrypts it using keys from Parameter Store
3. Transforms and cleanses the data
4. Loads it to the destination database
"""
# Get bucket and file info from the S3 event
source_bucket = event['Records'][0]['s3']['bucket']['name']
source_key = event['Records'][0]['s3']['object']['key']
print(f"Processing file {source_key} from bucket {source_bucket}")
# Extract - Get the encrypted file from S3
response = s3.get_object(Bucket=source_bucket, Key=source_key)
encrypted_data = response['Body'].read()
# Get decryption key from Parameter Store (more secure than hardcoding)
decrypt_key_response = ssm.get_parameter(
Name='/etl/decrypt-key',
WithDecryption=True
)
decrypt_key = decrypt_key_response['Parameter']['Value']
# Decrypt the data
fernet = Fernet(decrypt_key.encode())
decrypted_data = fernet.decrypt(encrypted_data)
# Convert to pandas DataFrame for easier transformation
df = pd.read_csv(io.BytesIO(decrypted_data))
# Transform - Here's where the ETL magic happens
# 1. Normalize patient names to uppercase
df['patient_name'] = df['patient_name'].str.upper()
# 2. Standardize phone number formats
df['phone_number'] = df['phone_number'].apply(
lambda x: re.sub(r'[^0-9]', '', str(x)) if pd.notnull(x) else None
)
# 3. Mask SSNs for HIPAA compliance (show only last 4 digits)
df['ssn'] = df['ssn'].apply(
lambda x: f"XXX-XX-{str(x)[-4:]}" if pd.notnull(x) and len(str(x)) >= 4 else None
)
# 4. Validate date formats and convert to standard format
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], errors='coerce').dt.strftime('%Y-%m-%d')
# 5. Add data quality flags
df['data_quality_score'] = df.apply(
lambda row: calculate_quality_score(row), axis=1
)
# Load - Save transformed data to destination
# For this example, I'll just save to another S3 bucket
# In production, I'd typically load to a database
output_buffer = io.StringIO()
df.to_csv(output_buffer, index=False)
s3.put_object(
Body=output_buffer.getvalue(),
Bucket='my-processed-data-bucket',
Key=f"processed/{source_key.split('/')[-1]}"
)
return {
'statusCode': 200,
'body': f"Successfully processed {source_key}"
}
def calculate_quality_score(row):
"""Helper function to calculate a simple data quality score"""
score = 100
# Reduce score for missing critical fields
for field in ['patient_name', 'date_of_birth', 'diagnosis_code']:
if pd.isnull(row[field]):
score -= 25
# Reduce score for potentially invalid data
if pd.notnull(row['phone_number']) and len(row['phone_number']) != 10:
score -= 10
return max(0, score) # Don't go below zero
My ELT Implementation: When Scale and Flexibility Matter
When I joined a fast-growing streaming service, our data needs were vastly different. We had:
Terabytes of user behavior data flowing in daily
Constantly changing business requirements for analytics
A team of data scientists who needed flexible access to raw data
This time, ELT made much more sense. Here's the ELT architecture I built using AWS S3, Databricks, and PySpark:
The Architecture I Created
Extract: Capture raw logs and events from our platforms
Load: Store everything in its raw form in our data lake on S3
Transform: Use Databricks to transform data on-demand for different needs
My Databricks ELT Implementation
Here's a simplified version of the PySpark code I wrote for our ELT pipeline:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, json_tuple, to_timestamp, unix_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, ArrayType
# Initialize Spark Session
spark = SparkSession.builder \
.appName("Streaming Service ELT Pipeline") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# EXTRACT - Just read the raw data
# Notice we're not doing any transformation yet - that's the ELT difference!
raw_events_df = spark.read.format("json") \
.option("inferSchema", "true") \
.load("s3://my-data-lake/raw/user-events/year=2025/month=06/day=*/")
# LOAD - Save the data immediately in its raw form to our processed zone
# We're using Delta Lake format for ACID transactions and time travel capabilities
raw_events_df.write \
.format("delta") \
.mode("append") \
.partitionBy("event_date") \
.save("s3://my-data-lake/processed/user-events/")
# Register as a table for SQL access
spark.sql("""
CREATE TABLE IF NOT EXISTS raw_user_events
USING DELTA
LOCATION 's3://my-data-lake/processed/user-events/'
""")
# TRANSFORM - Now we can create different views for different business needs
# All these transformations happen in Databricks, not before loading
# Example 1: Create a view for content recommendation team
spark.sql("""
CREATE OR REPLACE VIEW recommendation_team_view AS
SELECT
user_id,
content_id,
TO_TIMESTAMP(event_timestamp) as event_time,
watch_duration_seconds,
device_type,
platform
FROM
raw_user_events
WHERE
event_type = 'content_watched'
AND event_date >= CURRENT_DATE - INTERVAL 30 DAYS
""")
# Example 2: Create aggregated dataset for executive dashboards
exec_dashboard_df = spark.sql("""
SELECT
DATE_TRUNC('day', TO_TIMESTAMP(event_timestamp)) as date,
COUNT(DISTINCT user_id) as daily_active_users,
COUNT(DISTINCT CASE WHEN event_type = 'subscription_started' THEN user_id END) as new_subscribers,
COUNT(DISTINCT CASE WHEN event_type = 'subscription_canceled' THEN user_id END) as churned_subscribers,
SUM(CASE WHEN event_type = 'content_watched' THEN watch_duration_seconds ELSE 0 END) / 3600 as total_watch_hours
FROM
raw_user_events
WHERE
event_date >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY
DATE_TRUNC('day', TO_TIMESTAMP(event_timestamp))
ORDER BY
date DESC
""")
# Save the executive dashboard dataset to a more optimized format
exec_dashboard_df.write \
.format("delta") \
.mode("overwrite") \
.save("s3://my-data-lake/gold/executive_dashboard/")
When to Use ETL vs. ELT: Lessons from My Experience
After working with both patterns extensively, here's what I've learned about when to use each approach:
Choose ETL When:
Data privacy is paramount - When I need to mask PII or apply encryption before storage
Source data quality is poor - When data needs significant cleansing before it's usable
Transformation is complex but data volume is manageable - When complex business rules must be applied to relatively small datasets
Target systems have limited computing power - When the destination system isn't designed for heavy processing
Choose ELT When:
Data volumes are massive - When processing terabytes before loading would be inefficient
Requirements change frequently - When business needs evolve and you need flexibility to reprocess raw data
You have a modern data platform - When using tools like Databricks, Snowflake, or BigQuery that excel at in-platform transformations
You need to support diverse analytical needs - When different teams need different views of the same raw data
The Hybrid Approach That Saved My Project
In my most recent project, I actually implemented a hybrid approach:
Light ETL - Using AWS Lambda to handle initial validation and sensitive data masking
ELT core - Loading semi-processed data to Databricks and using PySpark for most transformations
# Example Lambda code for the light ETL step
def lambda_handler(event, context):
# Extract from S3 event
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Get the data
response = s3.get_object(Bucket=bucket, Key=key)
content = response['Body'].read().decode('utf-8')
# Minimal transformation - just mask sensitive PII
masked_content = mask_pii_fields(content)
# Load to "silver" area for further processing in Databricks
output_key = key.replace('bronze/', 'silver/')
s3.put_object(Body=masked_content, Bucket=bucket, Key=output_key)
# Trigger Databricks job for the ELT portion
databricks_client.run_now(job_id=ELT_JOB_ID, notebook_params={
"input_path": f"s3://{bucket}/{output_key}",
"output_path": f"s3://{bucket}/{output_key.replace('silver/', 'gold/')}"
})
return {'statusCode': 200, 'body': 'Pipeline started successfully'}
The Future: How I'm Evolving My Data Pipelines
The line between ETL and ELT continues to blur. With tools like AWS Glue, Databricks, and modern cloud data warehouses, the decision isn't always binary. Here's what I'm focusing on for my next generation of data pipelines:
Data contracts to ensure quality at the source
Streaming architectures that blur the line between batch ETL and ELT
Automated testing and quality monitoring regardless of approach
Metadata-driven pipelines that can adapt their behavior based on the data itself
Both ETL and ELT have their place in the modern data engineering toolkit. The key is understanding your specific requirements around data volume, quality, privacy, and analytical needs.
What's your experience with ETL vs ELT? I'd love to hear about your challenges and solutions in the comments!
Last updated