Kubeflow Pipelines

From Notebooks to Production Pipelines

Here's a scenario I've lived through: You build a model in a notebook, it works great, and then someone asks, "Can you retrain this with updated data?" You realize you have to:

Remember which notebook cells to run
Run them in the right order
Hope you remember all the parameters
Pray nothing breaks

Kubeflow Pipelines solves this by turning your ML workflow into code—reproducible, version-controlled, and automated.

What is a Kubeflow Pipeline?

A pipeline is a description of an ML workflow, including:

Components: Reusable steps (data loading, preprocessing, training, evaluation)
Dependencies: Which steps run in what order
Parameters: Configurable inputs
Artifacts: Outputs passed between steps

Key Insight: Pipelines are defined in Python but run as containerized steps in Kubernetes.

Installing the Kubeflow Pipelines SDK

# In your development environment (not Kubeflow notebook)
pip install kfp==2.6.0

# Verify
python -c "import kfp; print(kfp.__version__)"

For Python 3.12:

pip install --upgrade pip
pip install kfp==2.6.0

Your First Pipeline

Let's build a simple ML pipeline from scratch.

Step 1: Define Components

Components are the building blocks. Each component is a containerized step.

Simple Component Example:

from kfp import dsl

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['pandas==2.1.0', 'scikit-learn==1.3.0']
)
def load_data(data_path: str) -> dsl.Dataset:
    """Load and validate data."""
    import pandas as pd
    from pathlib import Path
    
    df = pd.read_csv(data_path)
    print(f"Loaded {len(df)} rows, {len(df.columns)} columns")
    
    # Save to artifact
    output_path = Path('/tmp/data.csv')
    df.to_csv(output_path, index=False)
    
    return dsl.Dataset(uri=str(output_path))

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['pandas==2.1.0', 'scikit-learn==1.3.0']
)
def preprocess_data(dataset: dsl.Input[dsl.Dataset]) -> dsl.Dataset:
    """Clean and preprocess data."""
    import pandas as pd
    from pathlib import Path
    
    df = pd.read_csv(dataset.uri)
    
    # Remove nulls
    df = df.dropna()
    
    # Feature engineering
    # ... your preprocessing logic ...
    
    print(f"Preprocessed data: {len(df)} rows")
    
    output_path = Path('/tmp/preprocessed_data.csv')
    df.to_csv(output_path, index=False)
    
    return dsl.Dataset(uri=str(output_path))

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['pandas==2.1.0', 'scikit-learn==1.3.0', 'joblib==1.3.2']
)
def train_model(
    dataset: dsl.Input[dsl.Dataset],
    n_estimators: int = 100,
    max_depth: int = 10
) -> dsl.Model:
    """Train a RandomForest model."""
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    import joblib
    from pathlib import Path
    
    # Load data
    df = pd.read_csv(dataset.uri)
    
    X = df.drop('target', axis=1)
    y = df['target']
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Train
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    print(f"Train accuracy: {train_score:.3f}")
    print(f"Test accuracy: {test_score:.3f}")
    
    # Save model
    model_path = Path('/tmp/model.pkl')
    joblib.dump(model, model_path)
    
    return dsl.Model(uri=str(model_path))

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['pandas==2.1.0', 'scikit-learn==1.3.0', 'joblib==1.3.2']
)
def evaluate_model(
    model: dsl.Input[dsl.Model],
    dataset: dsl.Input[dsl.Dataset],
    metrics: dsl.Output[dsl.Metrics]
) -> float:
    """Evaluate model and log metrics."""
    import pandas as pd
    import joblib
    from sklearn.metrics import accuracy_score, precision_score, recall_score
    
    # Load model and data
    clf = joblib.load(model.uri)
    df = pd.read_csv(dataset.uri)
    
    X = df.drop('target', axis=1)
    y = df['target']
    
    # Predict
    y_pred = clf.predict(X)
    
    # Calculate metrics
    accuracy = accuracy_score(y, y_pred)
    precision = precision_score(y, y_pred, average='weighted')
    recall = recall_score(y, y_pred, average='weighted')
    
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    
    # Log metrics
    metrics.log_metric('accuracy', accuracy)
    metrics.log_metric('precision', precision)
    metrics.log_metric('recall', recall)
    
    return accuracy

Step 2: Create a Pipeline

Connect components into a workflow:

@dsl.pipeline(
    name='ml-training-pipeline',
    description='End-to-end ML training pipeline'
)
def ml_pipeline(
    data_path: str = 'gs://my-bucket/data.csv',
    n_estimators: int = 100,
    max_depth: int = 10
):
    """ML training pipeline."""
    
    # Step 1: Load data
    load_task = load_data(data_path=data_path)
    
    # Step 2: Preprocess
    preprocess_task = preprocess_data(dataset=load_task.outputs['output'])
    
    # Step 3: Train model
    train_task = train_model(
        dataset=preprocess_task.outputs['output'],
        n_estimators=n_estimators,
        max_depth=max_depth
    )
    
    # Step 4: Evaluate
    eval_task = evaluate_model(
        model=train_task.outputs['output'],
        dataset=preprocess_task.outputs['output']
    )

Step 3: Compile and Run

from kfp import compiler
import kfp

# Compile pipeline to YAML
compiler.Compiler().compile(
    pipeline_func=ml_pipeline,
    package_path='ml_pipeline.yaml'
)

# Connect to Kubeflow Pipelines
client = kfp.Client(host='http://localhost:8080')

# Create experiment
experiment = client.create_experiment(name='ml-experiments')

# Run pipeline
run = client.run_pipeline(
    experiment_id=experiment.id,
    job_name='ml-training-run-1',
    pipeline_package_path='ml_pipeline.yaml',
    params={
        'data_path': 'gs://my-bucket/data.csv',
        'n_estimators': 150,
        'max_depth': 15
    }
)

print(f"Pipeline run created: {run.run_id}")

Pipeline Patterns

Pattern 1: Data Preprocessing Pipeline

@dsl.pipeline(name='data-prep-pipeline')
def data_preprocessing_pipeline(source_path: str, output_path: str):
    """Prepare data for training."""
    
    # Extract
    extract = extract_data(source=source_path)
    
    # Validate
    validate = validate_data(dataset=extract.output)
    
    # Transform
    transform = transform_features(dataset=validate.output)
    
    # Load
    load = save_to_storage(
        dataset=transform.output,
        destination=output_path
    )

Pattern 2: Training with Hyperparameter Tuning

@dsl.pipeline(name='hp-tuning-pipeline')
def hyperparameter_tuning_pipeline(data_path: str):
    """Train multiple models with different hyperparameters."""
    
    # Load data once
    data = load_data(data_path=data_path)
    
    # Train with different configs
    param_sets = [
        {'n_estimators': 50, 'max_depth': 5},
        {'n_estimators': 100, 'max_depth': 10},
        {'n_estimators': 200, 'max_depth': 15},
    ]
    
    results = []
    for params in param_sets:
        model = train_model(
            dataset=data.output,
            n_estimators=params['n_estimators'],
            max_depth=params['max_depth']
        )
        eval_result = evaluate_model(
            model=model.output,
            dataset=data.output
        )
        results.append(eval_result)
    
    # Select best model
    best_model = select_best_model(models=results)

Pattern 3: Conditional Execution

@dsl.pipeline(name='conditional-pipeline')
def conditional_training_pipeline(data_path: str, accuracy_threshold: float = 0.85):
    """Train model and deploy only if accuracy meets threshold."""
    
    # Train model
    data = load_data(data_path=data_path)
    model = train_model(dataset=data.output)
    accuracy = evaluate_model(model=model.output, dataset=data.output)
    
    # Deploy only if accuracy is good
    with dsl.If(accuracy.output > accuracy_threshold):
        deploy_model(model=model.output)

Working with Data in Pipelines

Using Cloud Storage

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['google-cloud-storage==2.10.0', 'pandas==2.1.0']
)
def load_from_gcs(bucket_name: str, blob_name: str) -> dsl.Dataset:
    """Load data from Google Cloud Storage."""
    from google.cloud import storage
    import pandas as pd
    from pathlib import Path
    
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    local_path = Path('/tmp/data.csv')
    blob.download_to_filename(local_path)
    
    df = pd.read_csv(local_path)
    print(f"Loaded {len(df)} rows from GCS")
    
    return dsl.Dataset(uri=str(local_path))

For S3:

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['boto3==1.28.0', 'pandas==2.1.0']
)
def load_from_s3(bucket_name: str, key: str) -> dsl.Dataset:
    """Load data from S3."""
    import boto3
    import pandas as pd
    from pathlib import Path
    
    s3 = boto3.client('s3')
    local_path = Path('/tmp/data.csv')
    
    s3.download_file(bucket_name, key, str(local_path))
    
    df = pd.read_csv(local_path)
    print(f"Loaded {len(df)} rows from S3")
    
    return dsl.Dataset(uri=str(local_path))

Pipeline Caching

Avoid re-running expensive steps:

@dsl.pipeline(name='cached-pipeline')
def pipeline_with_caching(data_path: str):
    """Pipeline that caches intermediate results."""
    
    # This step is cached - won't re-run if inputs haven't changed
    data = load_data(data_path=data_path)
    data.set_caching_options(enable_caching=True)
    
    # Preprocessing also cached
    preprocessed = preprocess_data(dataset=data.output)
    preprocessed.set_caching_options(enable_caching=True)
    
    # Training runs every time (usually what you want)
    model = train_model(dataset=preprocessed.output)
    model.set_caching_options(enable_caching=False)

Scheduling Pipelines

One-Time Run

client.run_pipeline(
    experiment_id=experiment.id,
    job_name='one-time-run',
    pipeline_package_path='pipeline.yaml'
)

Recurring Schedule

# Daily at 2 AM
client.create_recurring_run(
    experiment_id=experiment.id,
    job_name='daily-retrain',
    pipeline_package_path='pipeline.yaml',
    cron_expression='0 2 * * *',  # Cron format
    max_concurrency=1,
    enabled=True
)

Monitoring Pipelines

View in UI

Open Kubeflow Dashboard
Navigate to Pipelines
Select Experiments
Click on your experiment
View run details, logs, and metrics

Programmatically

# Get run status
run_detail = client.get_run(run_id=run.run_id)
print(f"Status: {run_detail.run.status}")

# Wait for completion
client.wait_for_run_completion(run_id=run.run_id, timeout=3600)

# Get metrics
metrics = client.get_run(run_id=run.run_id).run.metrics
for metric in metrics:
    print(f"{metric.name}: {metric.number_value}")

Best Practices

1. Keep Components Small and Focused

❌ One component that does everything ✅ Multiple components, each with a single responsibility

2. Version Your Pipelines

@dsl.pipeline(
    name='ml-pipeline',
    description='ML training pipeline v2.1.0'
)
def ml_pipeline_v2():
    # ...

3. Parameterize Everything

@dsl.pipeline(name='configurable-pipeline')
def pipeline(
    data_path: str,
    model_type: str = 'random_forest',
    test_size: float = 0.2,
    random_state: int = 42
):
    # Use parameters instead of hardcoding

4. Use Type Annotations

def train_model(
    dataset: dsl.Input[dsl.Dataset],
    n_estimators: int,
    learning_rate: float
) -> dsl.Model:
    # Type hints help catch errors early

5. Log Abundantly

print(f"Processing {len(df)} rows")
print(f"Features: {df.columns.tolist()}")
print(f"Target distribution: {df['target'].value_counts()}")

Logs appear in the Kubeflow UI for debugging.

Troubleshooting

Component Fails to Build

Error: ModuleNotFoundError: No module named 'pandas'

Solution: Add to packages_to_install:

@dsl.component(
    base_image='python:3.12-slim',
    packages_to_install=['pandas==2.1.0']  # Specify all dependencies
)

Pipeline Won't Compile

Error: Syntax or type errors

Solution: Test components individually first:

# Test component in isolation
result = load_data(data_path='test.csv')
print(result)

Slow Pipeline Execution

Causes:

Large data being passed between components
No caching
Unnecessary recomputation

Solutions:

Use cloud storage URIs instead of passing data
Enable caching for stable components
Parallelize independent steps

Key Takeaways

Pipelines make ML workflows reproducible and automated
Components are containerized—each gets clean dependencies
Use parameters for flexibility and experimentation
Leverage caching for expensive operations
Monitor and log everything for debugging

Next Steps

Now you can build automated training pipelines. In Model Training with Katib, we'll explore automated hyperparameter tuning to optimize your models.

Resources:

PreviousKubeflow Notebooks NextModel Training with Katib

Last updated 1 month ago

hashtagFrom Notebooks to Production Pipelines

hashtagWhat is a Kubeflow Pipeline?

hashtagInstalling the Kubeflow Pipelines SDK

hashtagYour First Pipeline

hashtagStep 1: Define Components

hashtagStep 2: Create a Pipeline

hashtagStep 3: Compile and Run

hashtagPipeline Patterns

hashtagPattern 1: Data Preprocessing Pipeline

hashtagPattern 2: Training with Hyperparameter Tuning

hashtagPattern 3: Conditional Execution

hashtagWorking with Data in Pipelines

hashtagUsing Cloud Storage

hashtagPipeline Caching

hashtagScheduling Pipelines

hashtagOne-Time Run

hashtagRecurring Schedule

hashtagMonitoring Pipelines

hashtagView in UI

hashtagProgrammatically

hashtagBest Practices

hashtag1. Keep Components Small and Focused

hashtag2. Version Your Pipelines

hashtag3. Parameterize Everything

hashtag4. Use Type Annotations

hashtag5. Log Abundantly

hashtagTroubleshooting

hashtagComponent Fails to Build

hashtagPipeline Won't Compile

hashtagSlow Pipeline Execution

hashtagKey Takeaways

hashtagNext Steps