Data Modeling & Warehousing

← Previous: Data Cleaning & Transformation | Back to Index | Next: ETL/ELT Pipelines β†’

Introduction

Data modeling determines how efficiently you can query your data warehouse. I've redesigned data models that reduced query times from minutes to seconds. This article covers the modeling patterns that work in production.

Star Schema vs Snowflake Schema

Star Schema

The star schema is my go-to pattern for most data warehouses.

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  Dim_Customer   β”‚
        β”‚  - customer_key β”‚
        β”‚  - name         β”‚
        β”‚  - email        β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   Fact_Transactions      β”‚
        β”‚   - transaction_id       β”‚
        β”‚   - customer_key (FK)    β”‚
        β”‚   - product_key (FK)     β”‚
        β”‚   - date_key (FK)        β”‚
        β”‚   - amount              β”‚
        β”‚   - quantity            β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                   β”‚              β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
    β”‚Dim_     β”‚       β”‚  Dim_Date   β”‚  β”‚Dim_      β”‚
    β”‚Product  β”‚       β”‚  - date_key β”‚  β”‚Location  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Advantages:

  • Simple joins (fact β†’ dimension)

  • Fast query performance

  • Easy to understand

When I use it:

  • 95% of my data warehouses

  • When query performance is critical

  • When business users self-serve

Slowly Changing Dimensions (SCD)

SCD Type 1 - Overwrite

SCD Type 2 - Track History

Data Warehouse vs Data Lake

Aspect
Data Warehouse
Data Lake

Data

Structured, processed

Raw, all formats

Schema

Schema-on-write

Schema-on-read

Users

Business analysts

Data scientists, engineers

Cost

Higher (storage + compute)

Lower (storage only)

Query

Fast (optimized)

Slower (scan files)

Use case

BI, reporting

ML, data exploration

My approach: Use both!

  • Data Lake: Store raw data (S3, Azure Data Lake)

  • Data Warehouse: Process and serve analytics (Snowflake, Redshift)

Dimensional Modeling Best Practices

Date Dimension Generation

Fact Table Design

Partitioning Strategies

Indexing Strategies

Conclusion

Good data modeling is the foundation of performant data warehouses. Star schemas work for 95% of use cases. SCD Type 2 preserves history. Partitioning and indexing optimize queries.

Key takeaways:

  • Star schema for most warehouses

  • Use SCD Type 2 for history tracking

  • Partition large fact tables by date

  • Index foreign keys and common query patterns

  • Combine data lake (raw) + data warehouse (processed)


Navigation:

Last updated