What is Data Engineering?

Five years ago, I started tinkering with data as a weekend hobbyist. What began with simple Python scripts to track my fitness stats, organize personal finances, and analyze my reading habits evolved into something much bigger. Today, I maintain several open-source data projects processing gigabytes of public datasets using AWS S3, Databricks, and PySpark. This post shares what I've learned building these personal data projects, with practical examples you can apply to your own experiments.

What Data Engineering Means in My Projects

To me, data engineering is about creating systems that transform raw, chaotic data into valuable insights. When I wanted to answer questions like "How consistent is my workout routine?" or "Which genres dominate my reading habits?", I needed reliable data pipelines to provide those answers quickly.

In my personal projects, data engineering involves designing and maintaining systems that collect, transform, and deliver data in usable formats. It's where my software development skills meet my passion for data analysis.

My Personal Data Engineering Toolkit

After experimenting with various technologies for my home projects, I've settled on this core stack:

AWS S3 as the storage foundation for my personal data lake
Databricks for running analysis and transformations
PySpark as my go-to processing framework

Here's how I use each tool in my personal data projects.

Building My Own Data Lake with AWS S3

My first attempts at organizing data were disastrous. I dumped everything into a single S3 bucket with no structure, making it impossible to find anything after just a few months.

Now my personal projects follow this carefully designed structure:

s3://personal-data-lake/
├── raw/
│   ├── source=fitness-tracker/
│   │   ├── year=2025/
│   │   │   ├── month=06/
│   │   │   │   ├── day=18/
│   │   │   │   │   ├── workout_data_20250618_1.parquet
│   │   │   │   │   ├── workout_data_20250618_2.parquet
│   ├── source=goodreads-exports/
│   │   ├── ...
├── processed/
│   ├── ...
├── curated/
│   ├── ...

This organization by source and date using partitioning has dramatically improved my ability to manage personal datasets. Here's how I automate this structure:

import boto3
from datetime import datetime

def upload_to_s3(data_frame, source):
   """Upload personal data to S3 using an organized partition structure"""
   today = datetime.now()
   s3_client = boto3.client('s3')
   
   # Create the S3 path with proper partitioning
   s3_path = f"raw/source={source}/year={today.year}/month={today.month:02d}/day={today.day:02d}/"
   
   # Generate a unique filename
   filename = f"data_{today.strftime('%Y%m%d_%H%M%S')}.parquet"
   
   # Save the DataFrame as a Parquet file locally first
   local_path = f"/tmp/{filename}"
   data_frame.to_parquet(local_path)
   
   # Upload to S3
   s3_client.upload_file(
      local_path, 
      "personal-data-lake", 
      s3_path + filename
   )
   
   print(f"Successfully uploaded data to s3://personal-data-lake/{s3_path}{filename}")
   return f"s3://personal-data-lake/{s3_path}{filename}"

This approach not only keeps my personal data organized but also makes analysis much faster through partition pruning.

Databricks: My Personal Analytics Playground

After struggling to manage my own local Spark setup (and nearly burning out my laptop), I switched to Databricks Community Edition for my personal projects. It's become my go-to environment for data analysis.

What makes Databricks perfect for personal projects:

Interactive notebooks for exploration
Scheduled jobs for regular data updates
Version control for my code
Delta Lake for reliable storage

Here's how I organize my personal Databricks workspace:

Create separate folders for:
- Exploration notebooks
- Production pipeline code
- Utility functions I reuse
Set up appropriate clusters:
- Small clusters that auto-terminate after an hour
- Right-sized resources to balance performance and cost
Use Delta Lake for personal projects:

# From my workout data analysis project
from pyspark.sql.functions import *
from delta.tables import *

# Read raw fitness tracker data
raw_workouts = spark.read.format("parquet").load(
   "s3://personal-data-lake/raw/source=fitness-tracker/year=2025/month=06/"
)

# Process the data - analyze workout patterns
workout_stats = raw_workouts \
   .withColumn("workout_date", to_date(col("timestamp"))) \
   .withColumn("workout_duration_minutes", col("duration_seconds") / 60) \
   .groupBy("workout_date", "workout_type") \
   .agg(
      sum("calories_burned").alias("total_calories"),
      sum("workout_duration_minutes").alias("total_minutes"),
      max("heart_rate_max").alias("max_heart_rate")
   )

# Write to a Delta table
workout_stats.write \
   .format("delta") \
   .partitionBy("workout_date") \
   .mode("append") \
   .option("mergeSchema", "true") \
   .save("s3://personal-data-lake/processed/workout_statistics/")

Delta Lake has been essential for my personal projects, allowing me to update data and fix mistakes without complicated workarounds.

PySpark: Processing My Personal Datasets

PySpark helps me analyze datasets that would crush my laptop's memory. I've developed patterns that make my personal data processing efficient and maintainable.

Here's an example from my book reading habits project:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, expr
from pyspark.sql.window import Window
import pyspark.sql.functions as F

# Initialize Spark session
spark = SparkSession.builder \
   .appName("Reading Habits Analysis") \
   .config("spark.sql.adaptive.enabled", "true") \
   .getOrCreate()

# Read book data exported from Goodreads
books_df = spark.read.format("csv") \
   .option("header", "true") \
   .load("s3://personal-data-lake/raw/goodreads-exports/")

# Extract and transform reading data
reading_stats = books_df \
   .withColumn("completion_date", to_date(col("Date Read"), "yyyy/MM/dd")) \
   .withColumn("start_date", to_date(col("Date Added"), "yyyy/MM/dd")) \
   .withColumn("reading_days", datediff(col("completion_date"), col("start_date"))) \
   .withColumn("pages_per_day", col("Number of Pages") / col("reading_days"))

# Analyze reading patterns by genre
genre_stats = reading_stats \
   .groupBy("Exclusive Shelf", "Bookshelves") \
   .agg(
      F.count("*").alias("books_read"),
      F.sum("Number of Pages").alias("total_pages"),
      F.avg("My Rating").alias("average_rating"),
      F.avg("pages_per_day").alias("avg_reading_speed")
   )

# Find favorite authors using window functions
window_spec = Window.orderBy(F.desc("books_read"))

favorite_authors = reading_stats \
   .groupBy("Author") \
   .agg(F.count("*").alias("books_read")) \
   .withColumn("rank", F.rank().over(window_spec)) \
   .filter(col("rank") <= 10)

# Save the results
genre_stats.write \
   .format("delta") \
   .mode("overwrite") \
   .save("s3://personal-data-lake/curated/reading_patterns/")

From experience with my personal projects, I've learned to:

Partition data smartly - Usually by date for my time-series fitness data
Use efficient file formats - Parquet or Delta Lake to save space and speed up queries
Leverage Spark's optimizations - Even for personal projects, they make a big difference
Monitor performance - The Spark UI helps me understand why some queries are slow

Integrating Tools: My End-to-End Personal Data Platform

The real magic happens when I connect these tools into a complete system. Here's how I built a personal health analytics platform:

Data Collection Layer:
- Fitness tracker data → CSV exports → Python script → S3 raw zone
- Food tracking app → API → AWS Lambda → S3 raw zone
- Sleep tracking data → JSON exports → Python script → S3 raw zone
Processing Layer (Databricks + PySpark):

# My weekly health data update job
dbutils.notebook.run("/Users/me/health-pipeline/01-import-tracker-data", 300)
dbutils.notebook.run("/Users/me/health-pipeline/02-calculate-health-metrics", 600)
dbutils.notebook.run("/Users/me/health-pipeline/03-generate-weekly-report", 300)

Visualization Layer:
- Processed data → Delta tables in S3
- Personal dashboard → Streamlit app reading from Delta tables
- Weekly summary → Automated email with key stats
Automation:
- Databricks jobs running on a weekly schedule
- Simple alerts when data looks unusual

Hard-Earned Lessons from Personal Projects

Building data pipelines for my own projects has taught me valuable lessons:

Garbage data produces garbage insights - I now validate data as soon as it enters my system
Plan for future growth - My initial scripts couldn't handle a year's worth of data
Add monitoring - I track simple metrics to know when something's wrong
Keep history - Delta Lake's time travel feature helps me recover when I make mistakes
Automate repetitive tasks - My early manual imports wasted countless hours

Start Your Own Data Engineering Journey

If you want to experiment with personal data engineering, here's how to begin:

Sign up for Databricks Community Edition - It's free and powerful
Create a minimal AWS setup for S3 storage (stays within free tier for personal projects)
Start with these beginner projects:
- Track and analyze personal fitness data
- Build a media consumption dashboard (books, movies, music)
- Create a personal finance tracker

My journey into data engineering began with curiosity about my own data and has grown into a passion for building systems that extract meaning from raw information. I hope sharing these personal projects inspires you to explore what you can build with your own data.

In my next post, I'll show how I used these same tools to build a recommendation system for my personal library. Stay tuned!

PreviousMy Journey with Continuous Delivery and Deployment in DevOps NextUnderstanding DataOps

Last updated 17 days ago