Exploratory Data Analysis with CRISP-DM

Published: June 23, 2025

Throughout my journey as a data engineering practitioner, I've found that a structured approach to data analysis projects has been crucial to their success. In this post, I'll share my personal experience with Exploratory Data Analysis (EDA) using the CRISP-DM framework, showing practical examples with Python libraries like pandas, numpy, and matplotlib.

What is CRISP-DM?

CRISP-DM (Cross Industry Standard Process for Data Mining) has been my go-to methodology for tackling data projects. It's a structured approach that breaks down the data analysis process into six phases:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

While the entire framework is valuable, I've found that EDA plays a critical role during the Data Understanding phase. Let me share how I typically approach this in my work.

The CRISP-DM Process Flow

Before diving deeper into EDA specifically, let's visualize the CRISP-DM workflow and where EDA fits within it:

This diagram illustrates how EDA (in the Data Understanding phase) connects with other phases of the CRISP-DM framework. Notice how the process isn't strictly linear—the evaluation phase can lead back to business understanding, creating an iterative cycle of improvement. In my experience, this iterative nature is what makes CRISP-DM so effective for real-world data projects.

Business Understanding: Setting the Stage

Before diving into any data analysis, I always start with business understanding. In a recent project for a retail client, we needed to optimize inventory levels across multiple stores. The key business questions were:

What products were consistently understocked or overstocked?
Were there seasonal patterns affecting inventory needs?
How did regional differences impact product demand?

This clear business context allowed me to plan an EDA that would specifically address these questions.

Data Understanding: The EDA Process

Once I have the business context, I begin my EDA process. Here's my typical approach with code examples:

1. Initial Data Exploration

First, I load the data and get a general overview:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
plt.style.use('ggplot')
sns.set(font_scale=1.2)

# Load the dataset
df = pd.read_csv('retail_inventory.csv')

# Get a quick overview
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

# Summary statistics
print("\nSummary statistics:")
print(df.describe())

# Check data types and missing values
print("\nData types and missing values:")
print(df.info())
print("\nMissing values per column:")
print(df.isna().sum())

This initial exploration gives me a sense of data size, structure, and quality. I've found that simply understanding the shape of your data can often reveal immediate insights or problems.

2. Univariate Analysis

Next, I examine individual variables to understand their distributions:

# Categorical variables: distribution
plt.figure(figsize=(10, 6))
df['product_category'].value_counts().plot(kind='bar')
plt.title('Distribution of Product Categories')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Numerical variables: distribution
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(df['daily_sales'], kde=True)
plt.title('Distribution of Daily Sales')

plt.subplot(1, 2, 2)
sns.boxplot(y=df['inventory_level'])
plt.title('Boxplot of Inventory Levels')
plt.tight_layout()
plt.show()

In the retail project, this revealed that certain product categories had significantly more variance in their inventory levels, immediately suggesting areas that needed special attention.

3. Temporal Analysis

For time-based patterns, which were crucial for our seasonal analysis:

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek

# Monthly sales trends
monthly_sales = df.groupby('month')['sales'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.lineplot(x='month', y='sales', data=monthly_sales, marker='o')
plt.title('Average Sales by Month')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(range(1, 13))
plt.grid(True)
plt.show()

# Day of week patterns
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_week', y='sales', data=df)
plt.title('Sales by Day of Week')
plt.xlabel('Day of Week (0=Monday, 6=Sunday)')
plt.ylabel('Average Sales')
plt.tight_layout()
plt.show()

This analysis revealed clear seasonal patterns in our retail data, with significant spikes during holiday seasons and weekends – information that became critical for our inventory planning.

4. Bivariate and Multivariate Analysis

Understanding relationships between variables:

# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

# Scatterplot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='inventory_level', y='stockout_frequency', data=df)
plt.title('Relationship Between Inventory Level and Stockout Frequency')
plt.xlabel('Average Inventory Level')
plt.ylabel('Stockout Frequency')
plt.grid(True)
plt.show()

# Category-based analysis
plt.figure(figsize=(12, 6))
sns.boxplot(x='region', y='sales', hue='product_category', data=df)
plt.title('Sales by Region and Product Category')
plt.xlabel('Region')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.legend(title='Product Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

These visualizations helped us discover that certain product categories had very different inventory-to-sales relationships across regions, suggesting we needed region-specific inventory policies.

5. Anomaly Detection

Identifying outliers and unusual patterns:

# Z-score based outlier detection
from scipy import stats

z_scores = stats.zscore(df['inventory_level'])
outliers = (abs(z_scores) > 3)

plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['inventory_level'], c=outliers, cmap='viridis')
plt.colorbar(label='Is Outlier')
plt.title('Outlier Detection in Inventory Levels')
plt.xlabel('Index')
plt.ylabel('Inventory Level')
plt.grid(True)
plt.show()

# Print some outlier examples
print("Outlier examples:")
print(df[outliers].head())

This analysis identified several stores with unusually high inventory levels relative to their sales, which turned out to be due to a system error in our inventory management software.

Transitioning to Data Preparation

After completing the EDA, I use the insights gained to guide my data preparation steps. For instance, in our retail project, the EDA revealed:

# Handle missing values based on EDA insights
df['inventory_level'] = df.groupby(['product_category', 'region'])['inventory_level'].transform(
    lambda x: x.fillna(x.median())
)

# Create new features based on patterns discovered
df['inventory_to_sales_ratio'] = df['inventory_level'] / df['daily_sales']
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)

# Remove outliers identified during EDA
df = df[abs(stats.zscore(df['inventory_level'])) <= 3]

print("Data after preparation:")
print(df.head())
print(f"Shape after cleaning: {df.shape}")

The EDA Workflow in Detail

Let me share the specific workflow I follow for EDA within a data engineering context:

This diagram represents my typical workflow when conducting EDA. Notice that it's also iterative—if the insights aren't sufficient to address the business questions, I refine my analysis and cycle through the process again. From my experience, this iterative approach has been crucial for successful data engineering projects where requirements often evolve.

Lessons Learned

Throughout my years of applying CRISP-DM and EDA in data engineering projects, I've learned several valuable lessons:

Never skip the business understanding phase: Without clear business questions, EDA can become unfocused and inefficient.
Visualization is communication: The best insights are worthless if stakeholders can't understand them. I invest time in creating clear, labeled visualizations.
Be methodical but flexible: While I follow a structured approach, I'm always ready to explore unexpected patterns that emerge during analysis.
Document as you go: I document all insights, including "negative" findings, as they often become valuable later in the project.
Iterate with stakeholders: I regularly share EDA results with business stakeholders to refine our understanding and adjust the analysis direction.

From EDA Insights to Data Engineering Decisions

One of the most valuable aspects of EDA for data engineers is how insights directly influence our data engineering decisions:

In my experience, this translation from EDA insights to concrete data engineering decisions is what ultimately delivers value to the business. For example, discovering seasonal patterns in our retail data led us to implement time-based partitioning in our data warehouse, significantly improving query performance for seasonal analysis.

Conclusion

Exploratory Data Analysis within the CRISP-DM framework has been the backbone of my data engineering practice. By systematically exploring data while keeping business goals in focus, I've been able to deliver insights that drive real business value.

The examples shared here reflect just a portion of the techniques I use daily. As data engineering continues to evolve, the fundamental importance of structured, thorough data exploration remains constant – a lesson I've learned through years of practice.

What EDA techniques have you found most valuable in your data engineering work? I'd love to hear your experiences in the comments below.

About the Author: A passionate data engineering practitioner with years of experience implementing data solutions across various industries.

PreviousCloud Native Applications NextMLOps Journey: A Data Engineer's Perspective with Databricks and GitLab

Last updated 7 days ago