SQL for Data Engineering

← Previous: Data Quality & Testing | Next: Cloud Data Platforms →

Introduction

SQL is the foundation of data engineering. Despite the rise of Python-based processing frameworks, I still use SQL daily—whether it's querying data warehouses, performing transformations in dbt, or optimizing slow queries. In this article, I'll share advanced SQL patterns and techniques I've found essential in production data engineering work.

This isn't a SQL basics tutorial. I'm assuming you know SELECT, WHERE, JOIN, and GROUP BY. Instead, I'll focus on advanced techniques like window functions, CTEs, query optimization, and working with large datasets—the skills that separate beginner from professional SQL usage.

Common Table Expressions (CTEs)

CTEs make complex queries readable and maintainable. I use them extensively to break down multi-step transformations:

Basic CTEs

-- calculate_user_metrics.sql
-- Calculate user engagement metrics using CTEs

WITH active_users AS (
    -- Step 1: Find users active in the last 30 days
    SELECT 
        user_id,
        COUNT(*) as activity_count,
        MAX(activity_date) as last_activity
    FROM user_activities
    WHERE activity_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
),
user_purchases AS (
    -- Step 2: Calculate total purchases per user
    SELECT 
        user_id,
        COUNT(*) as purchase_count,
        SUM(amount) as total_spent
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
),
user_segments AS (
    -- Step 3: Segment users based on activity and purchases
    SELECT 
        au.user_id,
        au.activity_count,
        COALESCE(up.purchase_count, 0) as purchase_count,
        COALESCE(up.total_spent, 0) as total_spent,
        CASE 
            WHEN COALESCE(up.purchase_count, 0) > 5 THEN 'high_value'
            WHEN COALESCE(up.purchase_count, 0) > 0 THEN 'medium_value'
            ELSE 'low_value'
        END as user_segment
    FROM active_users au
    LEFT JOIN user_purchases up ON au.user_id = up.user_id
)
-- Step 4: Generate final metrics
SELECT 
    user_segment,
    COUNT(*) as user_count,
    AVG(activity_count) as avg_activity,
    AVG(purchase_count) as avg_purchases,
    AVG(total_spent) as avg_spent
FROM user_segments
GROUP BY user_segment
ORDER BY avg_spent DESC;

Recursive CTEs

I've used recursive CTEs for hierarchical data like organization charts or category trees:

Window Functions

Window functions are game-changers for analytics. They allow calculations across related rows without collapsing the result set like GROUP BY does.

ROW_NUMBER, RANK, and DENSE_RANK

Running Totals and Moving Averages

LAG and LEAD for Time-Series Analysis

FIRST_VALUE and LAST_VALUE

Advanced JOIN Patterns

Self-Joins for Comparisons

LATERAL JOINs (PostgreSQL)

LATERAL joins are incredibly useful when you need to reference previous FROM items in the subquery:

Slowly Changing Dimensions (SCD) in SQL

SCD Type 2 - Historical Tracking

Point-in-Time Lookups

Query Optimization Techniques

Using EXPLAIN ANALYZE

I always use EXPLAIN ANALYZE to understand query performance:

Key things I look for in EXPLAIN output:

  • Sequential Scans on large tables (bad - need indexes)

  • Index Scans (good - using indexes)

  • Nested Loops with large datasets (can be slow)

  • Hash Joins or Merge Joins (better for large datasets)

  • Actual rows vs Estimated rows (big differences indicate statistics need updating)

Index Strategy

Avoiding Common Performance Pitfalls

Working with Large Datasets

Partitioning

For very large tables, partitioning improves query performance by allowing the database to scan only relevant partitions:

Incremental Processing

Advanced Aggregations

ROLLUP and CUBE

FILTER Clause

JSON and Semi-Structured Data

Querying JSON (PostgreSQL)

SQL Integration with Python

Here's how I typically use SQL from Python in data pipelines:

Best Practices

From my production SQL experience:

  1. Use CTEs for readability: Break complex queries into logical steps

  2. Parameterize queries: Always use parameters to prevent SQL injection

  3. Index strategically: Index columns used in WHERE, JOIN, and ORDER BY

  4. Analyze execution plans: Use EXPLAIN to identify bottlenecks

  5. Partition large tables: Use table partitioning for time-series or high-volume data

  6. **Avoid SELECT ***: Only select columns you need

  7. Use appropriate data types: Store dates as DATE, not VARCHAR

  8. Update statistics: Keep table statistics current for optimal query plans

  9. Test on production-sized data: Query performance differs greatly with scale

Key Takeaways

  • Window functions: Essential for analytics without collapsing rows

  • CTEs: Make complex queries maintainable and readable

  • Indexing: Critical for query performance at scale

  • Partitioning: Necessary for tables with billions of rows

  • Query optimization: Always use EXPLAIN to understand performance

  • JSON support: Modern databases handle semi-structured data well

  • Integration with Python: SQLAlchemy provides excellent database abstraction

← Previous: Data Quality & Testing | Next: Cloud Data Platforms →

Last updated