SQL for Data Engineering

← Previous: Data Quality & Testing | Next: Cloud Data Platforms →

Introduction

SQL is the foundation of data engineering. Despite the rise of Python-based processing frameworks, I still use SQL daily—whether it's querying data warehouses, performing transformations in dbt, or optimizing slow queries. In this article, I'll share advanced SQL patterns and techniques I've found essential in production data engineering work.

This isn't a SQL basics tutorial. I'm assuming you know SELECT, WHERE, JOIN, and GROUP BY. Instead, I'll focus on advanced techniques like window functions, CTEs, query optimization, and working with large datasets—the skills that separate beginner from professional SQL usage.

Common Table Expressions (CTEs)

CTEs make complex queries readable and maintainable. I use them extensively to break down multi-step transformations:

Basic CTEs

-- calculate_user_metrics.sql
-- Calculate user engagement metrics using CTEs

WITH active_users AS (
    -- Step 1: Find users active in the last 30 days
    SELECT 
        user_id,
        COUNT(*) as activity_count,
        MAX(activity_date) as last_activity
    FROM user_activities
    WHERE activity_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
),
user_purchases AS (
    -- Step 2: Calculate total purchases per user
    SELECT 
        user_id,
        COUNT(*) as purchase_count,
        SUM(amount) as total_spent
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
),
user_segments AS (
    -- Step 3: Segment users based on activity and purchases
    SELECT 
        au.user_id,
        au.activity_count,
        COALESCE(up.purchase_count, 0) as purchase_count,
        COALESCE(up.total_spent, 0) as total_spent,
        CASE 
            WHEN COALESCE(up.purchase_count, 0) > 5 THEN 'high_value'
            WHEN COALESCE(up.purchase_count, 0) > 0 THEN 'medium_value'
            ELSE 'low_value'
        END as user_segment
    FROM active_users au
    LEFT JOIN user_purchases up ON au.user_id = up.user_id
)
-- Step 4: Generate final metrics
SELECT 
    user_segment,
    COUNT(*) as user_count,
    AVG(activity_count) as avg_activity,
    AVG(purchase_count) as avg_purchases,
    AVG(total_spent) as avg_spent
FROM user_segments
GROUP BY user_segment
ORDER BY avg_spent DESC;

Recursive CTEs

I've used recursive CTEs for hierarchical data like organization charts or category trees:

-- get_category_hierarchy.sql
-- Get complete category hierarchy using recursive CTE

WITH RECURSIVE category_tree AS (
    -- Base case: root categories
    SELECT 
        category_id,
        category_name,
        parent_category_id,
        1 as level,
        category_name as path,
        ARRAY[category_id] as id_path
    FROM categories
    WHERE parent_category_id IS NULL
    
    UNION ALL
    
    -- Recursive case: child categories
    SELECT 
        c.category_id,
        c.category_name,
        c.parent_category_id,
        ct.level + 1,
        ct.path || ' > ' || c.category_name,
        ct.id_path || c.category_id
    FROM categories c
    INNER JOIN category_tree ct ON c.parent_category_id = ct.category_id
)
SELECT 
    category_id,
    category_name,
    level,
    path as full_path,
    id_path
FROM category_tree
ORDER BY id_path;

Window Functions

Window functions are game-changers for analytics. They allow calculations across related rows without collapsing the result set like GROUP BY does.

ROW_NUMBER, RANK, and DENSE_RANK

-- rank_products_by_sales.sql
-- Rank products by sales within each category

SELECT 
    category_id,
    product_id,
    product_name,
    total_sales,
    -- ROW_NUMBER: unique sequential number (1, 2, 3, 4, 5)
    ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY total_sales DESC) as row_num,
    -- RANK: same rank for ties, gaps in sequence (1, 2, 2, 4, 5)
    RANK() OVER (PARTITION BY category_id ORDER BY total_sales DESC) as rank,
    -- DENSE_RANK: same rank for ties, no gaps (1, 2, 2, 3, 4)
    DENSE_RANK() OVER (PARTITION BY category_id ORDER BY total_sales DESC) as dense_rank
FROM (
    SELECT 
        p.category_id,
        p.product_id,
        p.product_name,
        SUM(oi.quantity * oi.unit_price) as total_sales
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.category_id, p.product_id, p.product_name
) product_sales
ORDER BY category_id, total_sales DESC;

Running Totals and Moving Averages

-- calculate_running_metrics.sql
-- Calculate cumulative and moving metrics

SELECT 
    order_date,
    daily_revenue,
    -- Running total
    SUM(daily_revenue) OVER (
        ORDER BY order_date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_revenue,
    -- 7-day moving average
    AVG(daily_revenue) OVER (
        ORDER BY order_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7d,
    -- 30-day moving average
    AVG(daily_revenue) OVER (
        ORDER BY order_date 
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) as moving_avg_30d,
    -- Compare to previous day
    LAG(daily_revenue, 1) OVER (ORDER BY order_date) as prev_day_revenue,
    daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY order_date) as day_over_day_change,
    -- Compare to same day last week
    LAG(daily_revenue, 7) OVER (ORDER BY order_date) as same_day_last_week,
    daily_revenue - LAG(daily_revenue, 7) OVER (ORDER BY order_date) as week_over_week_change
FROM (
    SELECT 
        DATE(order_date) as order_date,
        SUM(total_amount) as daily_revenue
    FROM orders
    GROUP BY DATE(order_date)
) daily_sales
ORDER BY order_date;

LAG and LEAD for Time-Series Analysis

-- analyze_user_retention.sql
-- Calculate days between user sessions

SELECT 
    user_id,
    session_date,
    -- Previous session date
    LAG(session_date) OVER (PARTITION BY user_id ORDER BY session_date) as prev_session_date,
    -- Days since last session
    session_date - LAG(session_date) OVER (PARTITION BY user_id ORDER BY session_date) as days_since_last_session,
    -- Next session date
    LEAD(session_date) OVER (PARTITION BY user_id ORDER BY session_date) as next_session_date,
    -- Days until next session
    LEAD(session_date) OVER (PARTITION BY user_id ORDER BY session_date) - session_date as days_until_next_session,
    -- Session number for this user
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY session_date) as session_number
FROM user_sessions
WHERE user_id IN (SELECT user_id FROM users WHERE created_at >= '2024-01-01')
ORDER BY user_id, session_date;

FIRST_VALUE and LAST_VALUE

-- compare_to_first_and_last.sql
-- Compare each month to first and last month in window

SELECT 
    month,
    revenue,
    -- First month's revenue in this window
    FIRST_VALUE(revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as first_month_revenue,
    -- Last month's revenue in this window
    LAST_VALUE(revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as last_month_revenue,
    -- Growth since first month
    ((revenue - FIRST_VALUE(revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) 
     / FIRST_VALUE(revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) * 100) as growth_pct_since_start
FROM monthly_revenue
ORDER BY month;

Advanced JOIN Patterns

Self-Joins for Comparisons

-- find_similar_products.sql
-- Find products purchased together (market basket analysis)

SELECT 
    p1.product_id as product_1,
    p1.product_name as product_1_name,
    p2.product_id as product_2,
    p2.product_name as product_2_name,
    COUNT(DISTINCT o1.order_id) as times_purchased_together,
    COUNT(DISTINCT o1.order_id) * 100.0 / 
        (SELECT COUNT(DISTINCT order_id) FROM order_items WHERE product_id = p1.product_id) as support_pct
FROM order_items o1
JOIN order_items o2 ON o1.order_id = o2.order_id AND o1.product_id < o2.product_id
JOIN products p1 ON o1.product_id = p1.product_id
JOIN products p2 ON o2.product_id = p2.product_id
GROUP BY p1.product_id, p1.product_name, p2.product_id, p2.product_name
HAVING COUNT(DISTINCT o1.order_id) >= 100  -- At least 100 co-occurrences
ORDER BY times_purchased_together DESC
LIMIT 20;

LATERAL JOINs (PostgreSQL)

LATERAL joins are incredibly useful when you need to reference previous FROM items in the subquery:

-- get_top_products_per_category.sql
-- Get top 3 products per category using LATERAL join

SELECT 
    c.category_id,
    c.category_name,
    tp.product_id,
    tp.product_name,
    tp.total_sales,
    tp.rank
FROM categories c
CROSS JOIN LATERAL (
    SELECT 
        p.product_id,
        p.product_name,
        SUM(oi.quantity * oi.unit_price) as total_sales,
        ROW_NUMBER() OVER (ORDER BY SUM(oi.quantity * oi.unit_price) DESC) as rank
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    WHERE p.category_id = c.category_id  -- Reference outer query
    GROUP BY p.product_id, p.product_name
    ORDER BY total_sales DESC
    LIMIT 3
) tp
ORDER BY c.category_id, tp.rank;

Slowly Changing Dimensions (SCD) in SQL

SCD Type 2 - Historical Tracking

-- scd_type2_update.sql
-- Handle SCD Type 2 updates for dimension tables

-- Step 1: Expire changed records
UPDATE dim_customer
SET 
    valid_to = CURRENT_DATE - INTERVAL '1 day',
    is_current = FALSE
FROM staging_customer stg
WHERE 
    dim_customer.customer_id = stg.customer_id
    AND dim_customer.is_current = TRUE
    AND (
        dim_customer.name != stg.name
        OR dim_customer.email != stg.email
        OR dim_customer.address != stg.address
    );

-- Step 2: Insert new records (new and changed)
INSERT INTO dim_customer (
    customer_id,
    name,
    email,
    address,
    valid_from,
    valid_to,
    is_current
)
SELECT 
    stg.customer_id,
    stg.name,
    stg.email,
    stg.address,
    CURRENT_DATE as valid_from,
    '9999-12-31'::DATE as valid_to,
    TRUE as is_current
FROM staging_customer stg
LEFT JOIN dim_customer dim ON 
    dim.customer_id = stg.customer_id 
    AND dim.is_current = TRUE
WHERE 
    -- New records
    dim.customer_id IS NULL
    OR
    -- Changed records
    (
        dim.name != stg.name
        OR dim.email != stg.email
        OR dim.address != stg.address
    );

Point-in-Time Lookups

-- point_in_time_customer.sql
-- Query customer data as it was on a specific date

SELECT 
    o.order_id,
    o.order_date,
    c.customer_id,
    c.name,
    c.email,
    c.address
FROM orders o
JOIN dim_customer c ON 
    o.customer_id = c.customer_id
    AND o.order_date BETWEEN c.valid_from AND c.valid_to
WHERE o.order_date = '2024-01-15';

Query Optimization Techniques

Using EXPLAIN ANALYZE

I always use EXPLAIN ANALYZE to understand query performance:

-- Example: Analyze query performance
EXPLAIN ANALYZE
SELECT 
    c.category_name,
    COUNT(*) as product_count,
    AVG(p.price) as avg_price
FROM products p
JOIN categories c ON p.category_id = c.category_id
WHERE p.price > 100
GROUP BY c.category_name
ORDER BY product_count DESC;

Key things I look for in EXPLAIN output:

Sequential Scans on large tables (bad - need indexes)
Index Scans (good - using indexes)
Nested Loops with large datasets (can be slow)
Hash Joins or Merge Joins (better for large datasets)
Actual rows vs Estimated rows (big differences indicate statistics need updating)

Index Strategy

-- create_indexes.sql
-- Strategic index creation for query optimization

-- Single column index for exact lookups
CREATE INDEX idx_users_email ON users(email);

-- Composite index for multi-column WHERE clauses
CREATE INDEX idx_orders_user_date ON orders(user_id, order_date);

-- Partial index for common filtered queries
CREATE INDEX idx_orders_pending ON orders(status) 
WHERE status = 'pending';

-- Covering index (includes all needed columns)
CREATE INDEX idx_products_category_covering ON products(category_id) 
INCLUDE (product_name, price);

-- Index for JSON queries (PostgreSQL)
CREATE INDEX idx_user_preferences ON users USING GIN (preferences);

-- Full-text search index
CREATE INDEX idx_products_search ON products USING GIN (to_tsvector('english', product_name || ' ' || description));

Avoiding Common Performance Pitfalls

-- performance_pitfalls.sql
-- Examples of slow vs fast queries

-- BAD: Using functions on indexed columns prevents index usage
SELECT * FROM orders 
WHERE DATE(order_date) = '2024-01-15';

-- GOOD: Filter directly on the column
SELECT * FROM orders 
WHERE order_date >= '2024-01-15' AND order_date < '2024-01-16';

-- BAD: OR conditions can't use indexes efficiently
SELECT * FROM users 
WHERE email = '[email protected]' OR username = 'user123';

-- GOOD: Use UNION for OR with different columns
SELECT * FROM users WHERE email = '[email protected]'
UNION
SELECT * FROM users WHERE username = 'user123';

-- BAD: NOT IN with subquery (slow for large sets)
SELECT * FROM products 
WHERE product_id NOT IN (SELECT product_id FROM discontinued_products);

-- GOOD: Use LEFT JOIN with NULL check
SELECT p.* 
FROM products p
LEFT JOIN discontinued_products dp ON p.product_id = dp.product_id
WHERE dp.product_id IS NULL;

-- BAD: SELECT * when you only need specific columns
SELECT * FROM orders WHERE user_id = 123;

-- GOOD: Select only needed columns
SELECT order_id, order_date, total_amount 
FROM orders WHERE user_id = 123;

Working with Large Datasets

Partitioning

For very large tables, partitioning improves query performance by allowing the database to scan only relevant partitions:

-- create_partitioned_table.sql
-- Create range-partitioned table by date (PostgreSQL)

-- Parent table
CREATE TABLE orders_partitioned (
    order_id BIGINT,
    user_id BIGINT,
    order_date DATE NOT NULL,
    total_amount DECIMAL(10,2),
    status VARCHAR(50)
) PARTITION BY RANGE (order_date);

-- Create partitions for each month
CREATE TABLE orders_2024_01 PARTITION OF orders_partitioned
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02 PARTITION OF orders_partitioned
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE orders_2024_03 PARTITION OF orders_partitioned
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Index on each partition
CREATE INDEX idx_orders_2024_01_user ON orders_2024_01(user_id);
CREATE INDEX idx_orders_2024_02_user ON orders_2024_02(user_id);
CREATE INDEX idx_orders_2024_03_user ON orders_2024_03(user_id);

-- Queries automatically scan only relevant partitions
SELECT * FROM orders_partitioned 
WHERE order_date >= '2024-02-01' AND order_date < '2024-03-01';
-- This will only scan orders_2024_02 partition

Incremental Processing

-- incremental_aggregate.sql
-- Incrementally update aggregate tables

-- Daily aggregation job
INSERT INTO daily_user_metrics (
    metric_date,
    user_id,
    page_views,
    session_count,
    avg_session_duration
)
SELECT 
    DATE(event_timestamp) as metric_date,
    user_id,
    COUNT(CASE WHEN event_type = 'page_view' THEN 1 END) as page_views,
    COUNT(DISTINCT session_id) as session_count,
    AVG(session_duration) as avg_session_duration
FROM user_events
WHERE DATE(event_timestamp) = CURRENT_DATE - INTERVAL '1 day'
  AND event_timestamp >= (SELECT COALESCE(MAX(last_processed_at), '1970-01-01') FROM processing_watermarks WHERE table_name = 'daily_user_metrics')
GROUP BY DATE(event_timestamp), user_id
ON CONFLICT (metric_date, user_id) 
DO UPDATE SET
    page_views = EXCLUDED.page_views,
    session_count = EXCLUDED.session_count,
    avg_session_duration = EXCLUDED.avg_session_duration;

-- Update watermark
INSERT INTO processing_watermarks (table_name, last_processed_at)
VALUES ('daily_user_metrics', CURRENT_TIMESTAMP)
ON CONFLICT (table_name) 
DO UPDATE SET last_processed_at = EXCLUDED.last_processed_at;

Advanced Aggregations

ROLLUP and CUBE

-- hierarchical_aggregations.sql
-- Create subtotals and grand totals with ROLLUP

-- ROLLUP: Creates hierarchical subtotals
SELECT 
    COALESCE(region, 'ALL REGIONS') as region,
    COALESCE(category, 'ALL CATEGORIES') as category,
    COALESCE(product, 'ALL PRODUCTS') as product,
    SUM(sales) as total_sales
FROM sales_data
GROUP BY ROLLUP (region, category, product)
ORDER BY region, category, product;

-- CUBE: Creates subtotals for all combinations
SELECT 
    COALESCE(region, 'ALL') as region,
    COALESCE(category, 'ALL') as category,
    SUM(sales) as total_sales
FROM sales_data
GROUP BY CUBE (region, category)
ORDER BY region, category;

FILTER Clause

-- conditional_aggregations.sql
-- Use FILTER for cleaner conditional aggregations

SELECT 
    product_id,
    product_name,
    -- Old way: CASE in aggregation
    SUM(CASE WHEN order_status = 'completed' THEN quantity ELSE 0 END) as completed_qty_old,
    -- New way: FILTER clause (cleaner)
    SUM(quantity) FILTER (WHERE order_status = 'completed') as completed_qty,
    SUM(quantity) FILTER (WHERE order_status = 'pending') as pending_qty,
    SUM(quantity) FILTER (WHERE order_status = 'cancelled') as cancelled_qty,
    COUNT(*) FILTER (WHERE order_date >= CURRENT_DATE - INTERVAL '7 days') as orders_last_7_days
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
GROUP BY product_id, product_name;

JSON and Semi-Structured Data

Querying JSON (PostgreSQL)

-- query_json_data.sql
-- Working with JSONB data in PostgreSQL

-- Extract JSON fields
SELECT 
    user_id,
    preferences->>'theme' as theme,
    preferences->>'language' as language,
    (preferences->'notifications'->>'email')::BOOLEAN as email_notifications
FROM users
WHERE preferences->>'theme' = 'dark';

-- Array operations
SELECT 
    product_id,
    product_name,
    jsonb_array_length(tags) as tag_count,
    tags->0 as first_tag,
    tags @> '["electronics"]'::jsonb as has_electronics_tag
FROM products
WHERE tags @> '["featured"]'::jsonb;  -- Contains "featured" tag

-- Update JSON field
UPDATE users
SET preferences = jsonb_set(
    preferences,
    '{notifications, email}',
    'false'::jsonb
)
WHERE user_id = 123;

-- Aggregate JSON
SELECT 
    category,
    jsonb_agg(jsonb_build_object(
        'product_id', product_id,
        'name', product_name,
        'price', price
    )) as products
FROM products
GROUP BY category;

SQL Integration with Python

Here's how I typically use SQL from Python in data pipelines:

# sql_integration.py
from sqlalchemy import create_engine, text
from sqlalchemy.pool import NullPool
import pandas as pd
from typing import List, Dict, Any
from contextlib import contextmanager

class DatabaseManager:
    """Manage database connections and queries"""
    
    def __init__(self, connection_string: str):
        """
        Initialize database manager.
        
        Args:
            connection_string: SQLAlchemy connection string
        """
        self.engine = create_engine(
            connection_string,
            poolclass=NullPool,  # No connection pooling for ETL jobs
            echo=False
        )
    
    @contextmanager
    def get_connection(self):
        """Get database connection with automatic cleanup"""
        conn = self.engine.connect()
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            conn.close()
    
    def execute_query(self, query: str, params: Dict[str, Any] = None) -> pd.DataFrame:
        """
        Execute SELECT query and return DataFrame.
        
        Args:
            query: SQL query string
            params: Query parameters
            
        Returns:
            Query results as DataFrame
        """
        with self.get_connection() as conn:
            return pd.read_sql(text(query), conn, params=params)
    
    def execute_write(self, query: str, params: Dict[str, Any] = None) -> int:
        """
        Execute INSERT/UPDATE/DELETE query.
        
        Args:
            query: SQL query string
            params: Query parameters
            
        Returns:
            Number of rows affected
        """
        with self.get_connection() as conn:
            result = conn.execute(text(query), params or {})
            return result.rowcount
    
    def bulk_insert(
        self,
        df: pd.DataFrame,
        table_name: str,
        if_exists: str = 'append',
        chunksize: int = 1000
    ) -> None:
        """
        Bulk insert DataFrame to table.
        
        Args:
            df: DataFrame to insert
            table_name: Target table name
            if_exists: 'fail', 'replace', or 'append'
            chunksize: Rows per batch
        """
        df.to_sql(
            table_name,
            self.engine,
            if_exists=if_exists,
            index=False,
            chunksize=chunksize,
            method='multi'  # Use multi-row INSERT for speed
        )


# Example usage
if __name__ == "__main__":
    # Initialize database manager
    db = DatabaseManager("postgresql://user:pass@localhost:5432/mydb")
    
    # Execute parameterized query
    result = db.execute_query("""
        SELECT 
            user_id,
            name,
            email
        FROM users
        WHERE created_at >= :start_date
          AND status = :status
    """, params={
        "start_date": "2024-01-01",
        "status": "active"
    })
    
    print(f"Found {len(result)} users")
    
    # Bulk insert
    new_users = pd.DataFrame({
        'name': ['Alice', 'Bob'],
        'email': ['[email protected]', '[email protected]']
    })
    
    db.bulk_insert(new_users, 'users_staging', if_exists='replace')

Best Practices

From my production SQL experience:

Use CTEs for readability: Break complex queries into logical steps
Parameterize queries: Always use parameters to prevent SQL injection
Index strategically: Index columns used in WHERE, JOIN, and ORDER BY
Analyze execution plans: Use EXPLAIN to identify bottlenecks
Partition large tables: Use table partitioning for time-series or high-volume data
**Avoid SELECT ***: Only select columns you need
Use appropriate data types: Store dates as DATE, not VARCHAR
Update statistics: Keep table statistics current for optimal query plans
Test on production-sized data: Query performance differs greatly with scale

Key Takeaways

Window functions: Essential for analytics without collapsing rows
CTEs: Make complex queries maintainable and readable
Indexing: Critical for query performance at scale
Partitioning: Necessary for tables with billions of rows
Query optimization: Always use EXPLAIN to understand performance
JSON support: Modern databases handle semi-structured data well
Integration with Python: SQLAlchemy provides excellent database abstraction

← Previous: Data Quality & Testing | Next: Cloud Data Platforms →

PreviousData Quality & Testing NextCloud Data Platforms

Last updated 1 month ago

hashtagIntroduction

hashtagCommon Table Expressions (CTEs)

hashtagBasic CTEs

hashtagRecursive CTEs

hashtagWindow Functions

hashtagROW_NUMBER, RANK, and DENSE_RANK

hashtagRunning Totals and Moving Averages

hashtagLAG and LEAD for Time-Series Analysis

hashtagFIRST_VALUE and LAST_VALUE

hashtagAdvanced JOIN Patterns

hashtagSelf-Joins for Comparisons

hashtagLATERAL JOINs (PostgreSQL)

hashtagSlowly Changing Dimensions (SCD) in SQL

hashtagSCD Type 2 - Historical Tracking

hashtagPoint-in-Time Lookups

hashtagQuery Optimization Techniques

hashtagUsing EXPLAIN ANALYZE

hashtagIndex Strategy

hashtagAvoiding Common Performance Pitfalls

hashtagWorking with Large Datasets

hashtagPartitioning

hashtagIncremental Processing

hashtagAdvanced Aggregations

hashtagROLLUP and CUBE

hashtagFILTER Clause

hashtagJSON and Semi-Structured Data

hashtagQuerying JSON (PostgreSQL)

hashtagSQL Integration with Python

hashtagBest Practices

hashtagKey Takeaways