Model Training with Katib

The Hyperparameter Problem

You've probably been here: Training a model, manually tweaking n_estimators, trying different learning_rate values, running the same code over and over with slight variations.

# The manual approach (we've all done this)
for lr in [0.001, 0.01, 0.1]:
    for n_est in [50, 100, 200]:
        model = RandomForestClassifier(learning_rate=lr, n_estimators=n_est)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        print(f"lr={lr}, n_est={n_est}, score={score}")

Katib automates this entire process with sophisticated optimization algorithms, parallel execution, and automatic tracking.

What is Katib?

Katib is Kubeflow's hyperparameter tuning and neural architecture search component. It:

  • Runs multiple training jobs with different hyperparameters in parallel

  • Uses optimization algorithms (grid search, random search, Bayesian optimization)

  • Tracks all experiments automatically

  • Finds optimal hyperparameters based on your objective

The win: Instead of babysitting experiments, you define search space and let Katib optimize.

Your First Katib Experiment

Step 1: Create a Training Script

First, create a training script that accepts command-line arguments:

Step 2: Containerize Your Training Code

Build and push:

Step 3: Define Katib Experiment

Create a YAML file defining the hyperparameter search:

Step 4: Run the Experiment

Step 5: View Results

Access Katib UI through Kubeflow Dashboard:

  1. Open Kubeflow Dashboard

  2. Go to KatibExperiments

  3. Click on random-forest-tuning

  4. View trials, metrics, and best parameters

Or use CLI:

Optimization Algorithms

Katib supports several search strategies:

Randomly samples from parameter space.

When to use: Quick exploration, many hyperparameters, limited budget

Tests all combinations systematically.

When to use: Few hyperparameters, exhaustive search needed

Warning: Trials = product of all parameter values. Gets expensive fast!

3. Bayesian Optimization

Uses previous trials to inform next choices (smart exploration).

When to use: Expensive training, <20 hyperparameters, want efficiency

4. Hyperband

Adaptive algorithm that allocates more resources to promising trials.

When to use: Many hyperparameters, early stopping possible

5. TPE (Tree-structured Parzen Estimator)

Similar to Bayesian but often faster.

When to use: Alternative to Bayesian, good default choice

Parameter Types

Continuous Parameters

Discrete Integer Parameters

Categorical Parameters

Practical Example: Deep Learning with PyTorch

Training Script

Katib Experiment for PyTorch

Early Stopping

Stop unpromising trials early to save resources:

Your training script needs to report intermediate results:

Integrating with Kubeflow Pipelines

Run Katib experiments as pipeline components:

Best Practices

Don't overcomplicate initially:

2. Use Logarithmic Scale for Learning Rates

Consider log-uniform sampling in your training code.

3. Limit Search Space Initially

❌ Too broad:

✅ Reasonable range:

4. Monitor Resource Usage

5. Use Early Stopping for Deep Learning

Saves significant compute time:

Common Issues

Trials Keep Failing

Debug:

Common causes:

  • Import errors: Missing dependencies in container

  • OOM: Increase memory limits

  • Wrong hyperparameter ranges: Check logs for value errors

No Improvement After Many Trials

Possible reasons:

  • Search space doesn't include optimal values

  • Objective metric not being captured correctly

  • Model/architecture limitations (hyperparameters won't help)

Solution: Review search space, verify metric logging, consider model changes

Slow Experiment Progress

Speed it up:

Key Takeaways

  1. Katib automates hyperparameter search at scale

  2. Start simple: random search with limited trials

  3. Use Bayesian/TPE for expensive training

  4. Early stopping saves compute for deep learning

  5. Integrate with pipelines for end-to-end automation

Next Steps

With optimized models, it's time to deploy them. In Model Serving with KServe, we'll learn how to serve models for production inference.


Resources:

Last updated