MLOps Journey: A Data Engineer's Perspective with Databricks and GitLab
Published: June 30, 2025
As a data engineering practitioner, I've witnessed the evolution of machine learning operations from experimental notebooks to production-ready systems. My journey with MLOps has been filled with challenges, learnings, and transformative experiences. In this post, I'll share my personal experiences implementing MLOps practices using Databricks Community Edition, Python, and GitLab as my technology stack of choice.
The Data Engineer's Dilemma in ML Projects
When I first transitioned from traditional data engineering to machine learning projects, I quickly realized that my existing toolkit and processes were insufficient. Traditional data pipelines were deterministic and relatively straightforward to test and deploy. Machine learning pipelines, however, introduced new complexities:
Models that behaved differently with varying data distributions
Experiments that needed careful tracking and reproducibility
Model drift that required continuous monitoring
Increased collaboration needs between data scientists and engineers
I found myself asking: How do I bring the same level of rigor and automation to ML workflows that I've established for data processing pipelines?
My MLOps Architecture Journey
After numerous iterations, I developed an MLOps architecture that balanced flexibility with governance. Here's a sequence diagram showing the end-to-end workflow I established:
This workflow helped establish clear handoffs between roles while maintaining the flexibility data scientists needed for experimentation.
Setting Up the Infrastructure
Databricks Community Edition: The Experimentation Platform
Databricks Community Edition became the foundation of my MLOps practice for several reasons:
It provided a collaborative notebook environment that data scientists loved
It included built-in MLflow for experiment tracking
It offered seamless scaling for larger workloads
It was accessible without enterprise-level budgets
Setting up Databricks for MLOps wasn't trivial. I needed to:
Configure workspace permissions
Create cluster configurations that balanced cost with performance
Set up MLflow experiment tracking
Establish connections with GitLab
The most critical part was establishing the MLflow tracking server:
GitLab: Version Control and CI/CD Pipeline
While Databricks handled experimentation, I needed a robust system for version control, collaboration, and automated testing. GitLab became my platform of choice because:
It provided comprehensive CI/CD capabilities
It had excellent support for merge requests and code reviews
It integrated well with Python ecosystems
It facilitated collaboration between data scientists and engineers
I structured my GitLab repository to accommodate both the code and ML artifacts:
The CI/CD pipeline was configured to:
Run tests on code changes
Validate data quality
Train and validate models
Register models if they met performance criteria
Deploy models to production
Here's the GitLab CI/CD pipeline configuration that tied it all together:
The Data Processing and Model Training Workflow
One of my biggest challenges was establishing a repeatable process for data processing and model training. Here's a sequence diagram of the workflow I implemented:
The code to implement this workflow was designed to be both robust and maintainable:
The model training code followed a similar pattern of configuration-driven, trackable processes:
The Model Deployment and Monitoring Workflow
The final piece of my MLOps puzzle was model deployment and monitoring. This was perhaps the most challenging part, requiring careful orchestration between GitLab CI/CD, Databricks, and production systems. Here's a sequence diagram showing the deployment process:
The code for model deployment looked like this:
Lessons Learned as a Data Engineering Practitioner
Throughout my MLOps journey, I've learned several critical lessons that have shaped my practice as a data engineer:
1. Start with Strong Data Foundations
As a data engineer, I found that ML projects amplify data quality issues. My biggest success factor was investing heavily in data validation and quality controls. Prior to implementing MLOps, nearly 40% of model failures could be traced to data issues. After implementing robust data validation in the pipeline, this dropped to less than 10%.
2. Embrace Modularity
Making ML pipelines modular helps isolate issues and enables incremental improvements. I separate my pipelines into discrete steps:
Data ingestion
Validation
Preprocessing
Feature engineering
Model training
Evaluation
Deployment
This approach has reduced debugging time by 60% and made it easier to identify bottlenecks.
3. Automate Thoughtfully
Not everything should be automated immediately. I've found a phased approach works best:
Start with automating the most error-prone manual tasks
Add monitoring and alerting next
Finally, implement automated retraining and deployment
4. Version Everything
In ML systems, versioning goes beyond code. I track:
Data versions (using DVC)
Model versions (using MLflow)
Environment configurations
Experiment parameters
This comprehensive versioning has been crucial for reproducing results and debugging production issues.
5. Monitor Not Just Performance, But Data Too
My most valuable lesson was learning to monitor input data distributions in production. Several times, we caught data drift issues before they impacted model performance by setting up monitoring for:
Feature distributions
Input data schema changes
Data quality metrics
6. Collaboration Is Key
The most successful ML projects I've worked on involved close collaboration between data scientists and engineers. Using tools like:
Shared GitLab repositories
Clear documentation
Standardized notebooks in Databricks
Regular sync meetings
These practices have dramatically improved the transition from experiment to production.
Challenges I Faced and How I Overcame Them
Challenge 1: Environment Inconsistency
Problem: Models would work in development but fail in production due to environment differences.
Solution: I implemented Docker containers for consistent environments and created detailed environment.yml files for Databricks clusters. This reduced environment-related failures by 90%.
Challenge 2: Long-Running Training Jobs Breaking CI/CD
Problem: Model training could take hours, making CI/CD pipelines impractical.
Solution: I separated the CI/CD pipeline into stages and used Databricks Jobs API to handle long-running training processes asynchronously. This kept most CI jobs under 10 minutes while still ensuring quality.
Challenge 3: Model Drift in Production
Problem: Models would silently degrade over time as data patterns shifted.
Solution: I implemented:
Statistical monitoring of input feature distributions
Performance monitoring with sliding windows
Automated retraining triggers when metrics dropped below thresholds
This approach has caught drift issues weeks before they would have impacted business metrics.
Conclusion: The Continuous MLOps Journey
My MLOps journey as a data engineer has transformed how I approach machine learning projects. The integration of Databricks for experimentation, GitLab for CI/CD, and MLflow for experiment tracking has created a robust, reproducible ML pipeline that balances flexibility with governance.
The key takeaway from my experience is that MLOps is not a destination but a continuous journey of improvement. Start small, focus on the highest-value problems, and gradually expand your MLOps capabilities.
For data engineers taking their first steps into MLOps, I recommend:
Start with experiment tracking and model versioning
Focus on data quality and validation
Build modular pipelines that can evolve
Implement monitoring early
Collaborate closely with data scientists
I hope sharing my personal MLOps journey helps you on yours. What challenges are you facing in implementing MLOps in your organization? What tools and practices have you found most valuable? I'd love to hear about your experiences in the comments.
About the Author: A passionate data engineering practitioner with years of experience implementing MLOps solutions across various industries.
Last updated