Kubeflow Pipelines

From Notebooks to Production Pipelines

Here's a scenario I've lived through: You build a model in a notebook, it works great, and then someone asks, "Can you retrain this with updated data?" You realize you have to:

  1. Remember which notebook cells to run

  2. Run them in the right order

  3. Hope you remember all the parameters

  4. Pray nothing breaks

Kubeflow Pipelines solves this by turning your ML workflow into code—reproducible, version-controlled, and automated.

What is a Kubeflow Pipeline?

A pipeline is a description of an ML workflow, including:

  • Components: Reusable steps (data loading, preprocessing, training, evaluation)

  • Dependencies: Which steps run in what order

  • Parameters: Configurable inputs

  • Artifacts: Outputs passed between steps

Key Insight: Pipelines are defined in Python but run as containerized steps in Kubernetes.

Installing the Kubeflow Pipelines SDK

For Python 3.12:

Your First Pipeline

Let's build a simple ML pipeline from scratch.

Step 1: Define Components

Components are the building blocks. Each component is a containerized step.

Simple Component Example:

Step 2: Create a Pipeline

Connect components into a workflow:

Step 3: Compile and Run

Pipeline Patterns

Pattern 1: Data Preprocessing Pipeline

Pattern 2: Training with Hyperparameter Tuning

Pattern 3: Conditional Execution

Working with Data in Pipelines

Using Cloud Storage

For S3:

Pipeline Caching

Avoid re-running expensive steps:

Scheduling Pipelines

One-Time Run

Recurring Schedule

Monitoring Pipelines

View in UI

  1. Open Kubeflow Dashboard

  2. Navigate to Pipelines

  3. Select Experiments

  4. Click on your experiment

  5. View run details, logs, and metrics

Programmatically

Best Practices

1. Keep Components Small and Focused

❌ One component that does everything ✅ Multiple components, each with a single responsibility

2. Version Your Pipelines

3. Parameterize Everything

4. Use Type Annotations

5. Log Abundantly

Logs appear in the Kubeflow UI for debugging.

Troubleshooting

Component Fails to Build

Error: ModuleNotFoundError: No module named 'pandas'

Solution: Add to packages_to_install:

Pipeline Won't Compile

Error: Syntax or type errors

Solution: Test components individually first:

Slow Pipeline Execution

Causes:

  • Large data being passed between components

  • No caching

  • Unnecessary recomputation

Solutions:

  • Use cloud storage URIs instead of passing data

  • Enable caching for stable components

  • Parallelize independent steps

Key Takeaways

  1. Pipelines make ML workflows reproducible and automated

  2. Components are containerized—each gets clean dependencies

  3. Use parameters for flexibility and experimentation

  4. Leverage caching for expensive operations

  5. Monitor and log everything for debugging

Next Steps

Now you can build automated training pipelines. In Model Training with Katib, we'll explore automated hyperparameter tuning to optimize your models.


Resources:

Last updated