Tech With Htunn
  • Blog Content
  • ๐Ÿค–Artificial Intelligence
    • ๐Ÿง Building an Intelligent Agent with Local LLMs and Azure OpenAI
    • ๐Ÿ“ŠRevolutionizing IoT Monitoring: My Personal Journey with LLM-Powered Observability
  • ๐Ÿ“˜Core Concepts
    • ๐Ÿ”„Understanding DevSecOps
    • โฌ…๏ธShifting Left in DevSecOps
    • ๐Ÿ“ฆUnderstanding Containerization
    • โš™๏ธWhat is Site Reliability Engineering?
    • โฑ๏ธUnderstanding Toil in SRE
    • ๐Ÿ”What is Identity and Access Management?
    • ๐Ÿ“ŠMicrosoft Graph API: An Overview
    • ๐Ÿ”„Understanding Identity Brokers
  • ๐Ÿ”ŽSecurity Testing
    • ๐Ÿ”SAST vs DAST: Understanding the Differences
    • ๐ŸงฉSoftware Composition Analysis (SCA)
    • ๐Ÿ“‹Software Bill of Materials (SBOM)
    • ๐ŸงชDependency Scanning in DevSecOps
    • ๐ŸณContainer Scanning in DevSecOps
  • ๐Ÿ”„CI/CD Pipeline
    • ๐Ÿ”My Journey with Continuous Integration in DevOps
    • ๐Ÿš€My Journey with Continuous Delivery and Deployment in DevOps
  • ๐ŸงฎFundamentals
    • ๐Ÿ’พWhat is Data Engineering?
    • ๐Ÿ”„Understanding DataOps
    • ๐Ÿ‘ทThe Role of a Cloud Architect
    • ๐Ÿ›๏ธCloud Native Architecture
    • ๐Ÿ’ปCloud Native Applications
  • ๐Ÿ›๏ธArchitecture & Patterns
    • ๐Ÿ…Medallion Architecture in Data Engineering
    • ๐Ÿ”„ETL vs ELT Pipeline: Understanding the Differences
  • ๐Ÿ”’Authentication & Authorization
    • ๐Ÿ”‘OAuth 2.0 vs OIDC: Key Differences
    • ๐Ÿ”Understanding PKCE in OAuth 2.0
    • ๐Ÿ”„Service Provider vs Identity Provider Initiated SAML Flows
  • ๐Ÿ“‹Provisioning Standards
    • ๐Ÿ“ŠSCIM in Identity and Access Management
    • ๐Ÿ“กUnderstanding SCIM Streaming
  • ๐Ÿ—๏ธDesign Patterns
    • โšกEvent-Driven Architecture
    • ๐Ÿ”’Web Application Firewalls
  • ๐Ÿ“ŠReliability Metrics
    • ๐Ÿ’ฐError Budgets in SRE
    • ๐Ÿ“SLA vs SLO vs SLI: Understanding the Differences
    • โฑ๏ธMean Time to Recovery (MTTR)
Powered by GitBook
On this page
  • Understanding ETL and ELT: Beyond the Acronyms
  • ETL: The Traditional Approach I Started With
  • ELT: The Modern Approach That Changed My Work
  • My ETL Implementation: Solving a Real HIPAA Compliance Challenge
  • The Architecture I Designed
  • My AWS Lambda Function for the ETL Process
  • My ELT Implementation: When Scale and Flexibility Matter
  • The Architecture I Created
  • My Databricks ELT Implementation
  • When to Use ETL vs. ELT: Lessons from My Experience
  • Choose ETL When:
  • Choose ELT When:
  • The Hybrid Approach That Saved My Project
  • The Future: How I'm Evolving My Data Pipelines
  1. Architecture & Patterns

ETL vs ELT Pipeline: Understanding the Differences

When I first started as a data engineering practitioner five years ago, I struggled to understand when to use ETL versus ELT pipelines. After building dozens of data pipelines with AWS, Databricks, and Python, I've learned the critical differences through both successes and painful failures.

In this post, I'll share my hands-on experience with both approaches and provide practical examples using the tools that have become essential in my data engineering toolkit: AWS S3, AWS Lambda, Python, Databricks, and PySpark.

Understanding ETL and ELT: Beyond the Acronyms

Let's start with the basics. Both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are data integration patterns, but they differ fundamentally in where the transformation happens.

ETL: The Traditional Approach I Started With

In ETL, the workflow follows these steps:

  1. Extract: Pull data from source systems (databases, APIs, files)

  2. Transform: Process the data in a separate transformation layer

  3. Load: Move the clean, transformed data to its destination

When I first built ETL pipelines, I was working with sensitive healthcare data that required significant cleansing, validation, and masking before it could be stored in our data warehouse. ETL was perfect because it allowed us to handle these transformations before data ever reached the warehouse.

ELT: The Modern Approach That Changed My Work

In ELT, the sequence is:

  1. Extract: Pull raw data from source systems

  2. Load: Store this raw data in the target system immediately

  3. Transform: Leverage the target system's processing power to transform data

My transition to ELT happened when I joined a tech startup building analytics on massive clickstream datasets. Trying to transform terabytes of data before loading would have been prohibitively expensive and slow.

My ETL Implementation: Solving a Real HIPAA Compliance Challenge

Last year, I faced a challenging project: building a pipeline to process sensitive healthcare data that needed to follow strict HIPAA compliance rules. The data arrived in encrypted CSV files and needed significant cleansing and transformation before being made available for analytics.

This was a perfect scenario for ETL since we needed to:

  • Decrypt the data securely

  • Validate and transform fields

  • Mask sensitive information

  • Apply standard formatting

Here's how I built this ETL pipeline using AWS Lambda and S3:

The Architecture I Designed

  1. Source: Encrypted CSV files arriving in a secure S3 bucket

  2. Trigger: S3 event notifications triggering AWS Lambda

  3. Processing: Lambda function decrypting and transforming the data

  4. Destination: Transformed data loaded to our analytics database

My AWS Lambda Function for the ETL Process

This is the actual code I wrote (with sensitive details changed) that handled our ETL process:

import boto3
import pandas as pd
import io
import base64
from cryptography.fernet import Fernet
import re

# Initialize AWS clients
s3 = boto3.client('s3')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    """
    My ETL function that:
    1. Extracts encrypted data from S3
    2. Decrypts it using keys from Parameter Store
    3. Transforms and cleanses the data
    4. Loads it to the destination database
    """
    # Get bucket and file info from the S3 event
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    source_key = event['Records'][0]['s3']['object']['key']
    
    print(f"Processing file {source_key} from bucket {source_bucket}")
    
    # Extract - Get the encrypted file from S3
    response = s3.get_object(Bucket=source_bucket, Key=source_key)
    encrypted_data = response['Body'].read()
    
    # Get decryption key from Parameter Store (more secure than hardcoding)
    decrypt_key_response = ssm.get_parameter(
        Name='/etl/decrypt-key',
        WithDecryption=True
    )
    decrypt_key = decrypt_key_response['Parameter']['Value']
    
    # Decrypt the data
    fernet = Fernet(decrypt_key.encode())
    decrypted_data = fernet.decrypt(encrypted_data)
    
    # Convert to pandas DataFrame for easier transformation
    df = pd.read_csv(io.BytesIO(decrypted_data))
    
    # Transform - Here's where the ETL magic happens
    # 1. Normalize patient names to uppercase
    df['patient_name'] = df['patient_name'].str.upper()
    
    # 2. Standardize phone number formats
    df['phone_number'] = df['phone_number'].apply(
        lambda x: re.sub(r'[^0-9]', '', str(x)) if pd.notnull(x) else None
    )
    
    # 3. Mask SSNs for HIPAA compliance (show only last 4 digits)
    df['ssn'] = df['ssn'].apply(
        lambda x: f"XXX-XX-{str(x)[-4:]}" if pd.notnull(x) and len(str(x)) >= 4 else None
    )
    
    # 4. Validate date formats and convert to standard format
    df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], errors='coerce').dt.strftime('%Y-%m-%d')
    
    # 5. Add data quality flags
    df['data_quality_score'] = df.apply(
        lambda row: calculate_quality_score(row), axis=1
    )
    
    # Load - Save transformed data to destination
    # For this example, I'll just save to another S3 bucket
    # In production, I'd typically load to a database
    output_buffer = io.StringIO()
    df.to_csv(output_buffer, index=False)
    
    s3.put_object(
        Body=output_buffer.getvalue(),
        Bucket='my-processed-data-bucket',
        Key=f"processed/{source_key.split('/')[-1]}"
    )
    
    return {
        'statusCode': 200,
        'body': f"Successfully processed {source_key}"
    }

def calculate_quality_score(row):
    """Helper function to calculate a simple data quality score"""
    score = 100
    
    # Reduce score for missing critical fields
    for field in ['patient_name', 'date_of_birth', 'diagnosis_code']:
        if pd.isnull(row[field]):
            score -= 25
    
    # Reduce score for potentially invalid data
    if pd.notnull(row['phone_number']) and len(row['phone_number']) != 10:
        score -= 10
        
    return max(0, score)  # Don't go below zero

My ELT Implementation: When Scale and Flexibility Matter

When I joined a fast-growing streaming service, our data needs were vastly different. We had:

  • Terabytes of user behavior data flowing in daily

  • Constantly changing business requirements for analytics

  • A team of data scientists who needed flexible access to raw data

This time, ELT made much more sense. Here's the ELT architecture I built using AWS S3, Databricks, and PySpark:

The Architecture I Created

  1. Extract: Capture raw logs and events from our platforms

  2. Load: Store everything in its raw form in our data lake on S3

  3. Transform: Use Databricks to transform data on-demand for different needs

My Databricks ELT Implementation

Here's a simplified version of the PySpark code I wrote for our ELT pipeline:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, json_tuple, to_timestamp, unix_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, ArrayType

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Streaming Service ELT Pipeline") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# EXTRACT - Just read the raw data
# Notice we're not doing any transformation yet - that's the ELT difference!
raw_events_df = spark.read.format("json") \
    .option("inferSchema", "true") \
    .load("s3://my-data-lake/raw/user-events/year=2025/month=06/day=*/")

# LOAD - Save the data immediately in its raw form to our processed zone
# We're using Delta Lake format for ACID transactions and time travel capabilities
raw_events_df.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("event_date") \
    .save("s3://my-data-lake/processed/user-events/")

# Register as a table for SQL access
spark.sql("""
CREATE TABLE IF NOT EXISTS raw_user_events
USING DELTA
LOCATION 's3://my-data-lake/processed/user-events/'
""")

# TRANSFORM - Now we can create different views for different business needs
# All these transformations happen in Databricks, not before loading

# Example 1: Create a view for content recommendation team
spark.sql("""
CREATE OR REPLACE VIEW recommendation_team_view AS
SELECT 
    user_id,
    content_id,
    TO_TIMESTAMP(event_timestamp) as event_time,
    watch_duration_seconds,
    device_type,
    platform
FROM 
    raw_user_events
WHERE 
    event_type = 'content_watched'
    AND event_date >= CURRENT_DATE - INTERVAL 30 DAYS
""")

# Example 2: Create aggregated dataset for executive dashboards
exec_dashboard_df = spark.sql("""
SELECT
    DATE_TRUNC('day', TO_TIMESTAMP(event_timestamp)) as date,
    COUNT(DISTINCT user_id) as daily_active_users,
    COUNT(DISTINCT CASE WHEN event_type = 'subscription_started' THEN user_id END) as new_subscribers,
    COUNT(DISTINCT CASE WHEN event_type = 'subscription_canceled' THEN user_id END) as churned_subscribers,
    SUM(CASE WHEN event_type = 'content_watched' THEN watch_duration_seconds ELSE 0 END) / 3600 as total_watch_hours
FROM
    raw_user_events
WHERE
    event_date >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY
    DATE_TRUNC('day', TO_TIMESTAMP(event_timestamp))
ORDER BY
    date DESC
""")

# Save the executive dashboard dataset to a more optimized format
exec_dashboard_df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("s3://my-data-lake/gold/executive_dashboard/")

When to Use ETL vs. ELT: Lessons from My Experience

After working with both patterns extensively, here's what I've learned about when to use each approach:

Choose ETL When:

  1. Data privacy is paramount - When I need to mask PII or apply encryption before storage

  2. Source data quality is poor - When data needs significant cleansing before it's usable

  3. Transformation is complex but data volume is manageable - When complex business rules must be applied to relatively small datasets

  4. Target systems have limited computing power - When the destination system isn't designed for heavy processing

Choose ELT When:

  1. Data volumes are massive - When processing terabytes before loading would be inefficient

  2. Requirements change frequently - When business needs evolve and you need flexibility to reprocess raw data

  3. You have a modern data platform - When using tools like Databricks, Snowflake, or BigQuery that excel at in-platform transformations

  4. You need to support diverse analytical needs - When different teams need different views of the same raw data

The Hybrid Approach That Saved My Project

In my most recent project, I actually implemented a hybrid approach:

  1. Light ETL - Using AWS Lambda to handle initial validation and sensitive data masking

  2. ELT core - Loading semi-processed data to Databricks and using PySpark for most transformations

# Example Lambda code for the light ETL step
def lambda_handler(event, context):
    # Extract from S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Get the data
    response = s3.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    
    # Minimal transformation - just mask sensitive PII
    masked_content = mask_pii_fields(content)
    
    # Load to "silver" area for further processing in Databricks
    output_key = key.replace('bronze/', 'silver/')
    s3.put_object(Body=masked_content, Bucket=bucket, Key=output_key)
    
    # Trigger Databricks job for the ELT portion
    databricks_client.run_now(job_id=ELT_JOB_ID, notebook_params={
        "input_path": f"s3://{bucket}/{output_key}",
        "output_path": f"s3://{bucket}/{output_key.replace('silver/', 'gold/')}"
    })
    
    return {'statusCode': 200, 'body': 'Pipeline started successfully'}

The Future: How I'm Evolving My Data Pipelines

The line between ETL and ELT continues to blur. With tools like AWS Glue, Databricks, and modern cloud data warehouses, the decision isn't always binary. Here's what I'm focusing on for my next generation of data pipelines:

  1. Data contracts to ensure quality at the source

  2. Streaming architectures that blur the line between batch ETL and ELT

  3. Automated testing and quality monitoring regardless of approach

  4. Metadata-driven pipelines that can adapt their behavior based on the data itself

Both ETL and ELT have their place in the modern data engineering toolkit. The key is understanding your specific requirements around data volume, quality, privacy, and analytical needs.

What's your experience with ETL vs ELT? I'd love to hear about your challenges and solutions in the comments!

PreviousMedallion Architecture in Data EngineeringNextOAuth 2.0 vs OIDC: Key Differences

Last updated 1 day ago

๐Ÿ›๏ธ
๐Ÿ”„