Model Training with Katib

The Hyperparameter Problem

You've probably been here: Training a model, manually tweaking n_estimators, trying different learning_rate values, running the same code over and over with slight variations.

# The manual approach (we've all done this)
for lr in [0.001, 0.01, 0.1]:
    for n_est in [50, 100, 200]:
        model = RandomForestClassifier(learning_rate=lr, n_estimators=n_est)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        print(f"lr={lr}, n_est={n_est}, score={score}")

Katib automates this entire process with sophisticated optimization algorithms, parallel execution, and automatic tracking.

What is Katib?

Katib is Kubeflow's hyperparameter tuning and neural architecture search component. It:

Runs multiple training jobs with different hyperparameters in parallel
Uses optimization algorithms (grid search, random search, Bayesian optimization)
Tracks all experiments automatically
Finds optimal hyperparameters based on your objective

The win: Instead of babysitting experiments, you define search space and let Katib optimize.

Your First Katib Experiment

Step 1: Create a Training Script

First, create a training script that accepts command-line arguments:

# train.py
import argparse
import json
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def train(learning_rate: float, n_estimators: int, max_depth: int):
    """Train model with given hyperparameters."""
    
    # Load data
    X, y = load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Train
    model = RandomForestClassifier(
        n_estimators=int(n_estimators),
        max_depth=int(max_depth) if max_depth > 0 else None,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Katib expects metrics to be logged in this format
    print(f"accuracy={accuracy:.6f}")
    
    return accuracy

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--learning-rate', type=float, default=0.01)
    parser.add_argument('--n-estimators', type=int, default=100)
    parser.add_argument('--max-depth', type=int, default=10)
    
    args = parser.parse_args()
    
    train(
        learning_rate=args.learning_rate,
        n_estimators=args.n_estimators,
        max_depth=args.max_depth
    )

Step 2: Containerize Your Training Code

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install dependencies
RUN pip install --no-cache-dir \
    scikit-learn==1.3.0 \
    numpy>=1.26.0

# Copy training script
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Build and push:

docker build -t my-registry/katib-training:v1 .
docker push my-registry/katib-training:v1

Step 3: Define Katib Experiment

Create a YAML file defining the hyperparameter search:

# katib-experiment.yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-forest-tuning
  namespace: ml-workspace
spec:
  objective:
    type: maximize
    goal: 0.98
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3  # Run 3 trials in parallel
  maxTrialCount: 12       # Total 12 trials
  maxFailedTrialCount: 3
  
  parameters:
  - name: n-estimators
    parameterType: int
    feasibleSpace:
      min: "50"
      max: "200"
  - name: max-depth
    parameterType: int
    feasibleSpace:
      min: "5"
      max: "20"
  
  trialTemplate:
    primaryContainerName: training
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training
              image: my-registry/katib-training:v1
              command:
              - "python"
              - "train.py"
              - "--n-estimators={{.HyperParameters.n-estimators}}"
              - "--max-depth={{.HyperParameters.max-depth}}"
            restartPolicy: Never

Step 4: Run the Experiment

kubectl apply -f katib-experiment.yaml

# Watch progress
kubectl get experiment -n ml-workspace -w

# View details
kubectl describe experiment random-forest-tuning -n ml-workspace

Step 5: View Results

Access Katib UI through Kubeflow Dashboard:

Open Kubeflow Dashboard
Go to Katib → Experiments
Click on random-forest-tuning
View trials, metrics, and best parameters

Or use CLI:

# Get experiment status
kubectl get experiment random-forest-tuning -n ml-workspace -o yaml

# View best trial
kubectl get experiment random-forest-tuning -n ml-workspace \
  -o jsonpath='{.status.currentOptimalTrial.bestTrialName}'

Optimization Algorithms

Katib supports several search strategies:

1. Random Search

Randomly samples from parameter space.

algorithm:
  algorithmName: random

When to use: Quick exploration, many hyperparameters, limited budget

2. Grid Search

Tests all combinations systematically.

algorithm:
  algorithmName: grid

When to use: Few hyperparameters, exhaustive search needed

Warning: Trials = product of all parameter values. Gets expensive fast!

3. Bayesian Optimization

Uses previous trials to inform next choices (smart exploration).

algorithm:
  algorithmName: bayesianoptimization
  algorithmSettings:
  - name: base_estimator
    value: GP  # Gaussian Process
  - name: n_initial_points
    value: "10"
  - name: acq_func
    value: gp_hedge

When to use: Expensive training, <20 hyperparameters, want efficiency

4. Hyperband

Adaptive algorithm that allocates more resources to promising trials.

algorithm:
  algorithmName: hyperband
  algorithmSettings:
  - name: eta
    value: "3"
  - name: r_l
    value: "9"

When to use: Many hyperparameters, early stopping possible

5. TPE (Tree-structured Parzen Estimator)

Similar to Bayesian but often faster.

algorithm:
  algorithmName: tpe
  algorithmSettings:
  - name: n_EI_candidates
    value: "24"

When to use: Alternative to Bayesian, good default choice

Parameter Types

Continuous Parameters

- name: learning-rate
  parameterType: double
  feasibleSpace:
    min: "0.001"
    max: "0.1"
    step: "0.001"  # Optional

Discrete Integer Parameters

- name: num-layers
  parameterType: int
  feasibleSpace:
    min: "2"
    max: "10"

Categorical Parameters

- name: optimizer
  parameterType: categorical
  feasibleSpace:
    list:
    - "sgd"
    - "adam"
    - "rmsprop"

Practical Example: Deep Learning with PyTorch

Training Script

# train_pytorch.py
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

class SimpleNet(nn.Module):
    def __init__(self, hidden_size, dropout_rate):
        super().__init__()
        self.fc1 = nn.Linear(784, hidden_size)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden_size, 10)
    
    def forward(self, x):
        x = x.view(-1, 784)
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

def train(learning_rate, batch_size, hidden_size, dropout_rate, epochs=5):
    """Train model with hyperparameters."""
    
    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    train_dataset = datasets.MNIST('./data', train=True, download=True,
                                   transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, transform=transform)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    
    # Model
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = SimpleNet(hidden_size, dropout_rate).to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training
    for epoch in range(epochs):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
    
    # Evaluation
    model.eval()
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()
    
    accuracy = correct / total
    print(f"accuracy={accuracy:.6f}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--learning-rate', type=float, default=0.001)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--hidden-size', type=int, default=128)
    parser.add_argument('--dropout-rate', type=float, default=0.2)
    parser.add_argument('--epochs', type=int, default=5)
    
    args = parser.parse_args()
    train(**vars(args))

Katib Experiment for PyTorch

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: pytorch-mnist-tuning
  namespace: ml-workspace
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  
  algorithm:
    algorithmName: tpe
  
  parallelTrialCount: 4
  maxTrialCount: 20
  
  parameters:
  - name: learning-rate
    parameterType: double
    feasibleSpace:
      min: "0.0001"
      max: "0.01"
  - name: batch-size
    parameterType: categorical
    feasibleSpace:
      list: ["32", "64", "128"]
  - name: hidden-size
    parameterType: int
    feasibleSpace:
      min: "64"
      max: "512"
  - name: dropout-rate
    parameterType: double
    feasibleSpace:
      min: "0.1"
      max: "0.5"
  
  trialTemplate:
    primaryContainerName: training
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training
              image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
              command:
              - "python"
              - "-u"
              - "train_pytorch.py"
              - "--learning-rate={{.HyperParameters.learning-rate}}"
              - "--batch-size={{.HyperParameters.batch-size}}"
              - "--hidden-size={{.HyperParameters.hidden-size}}"
              - "--dropout-rate={{.HyperParameters.dropout-rate}}"
              resources:
                limits:
                  nvidia.com/gpu: 1
            restartPolicy: Never

Early Stopping

Stop unpromising trials early to save resources:

spec:
  earlyStopping:
    algorithmName: medianstop
    algorithmSettings:
    - name: min_trials_required
      value: "3"
    - name: start_step
      value: "5"

Your training script needs to report intermediate results:

# Report accuracy after each epoch
for epoch in range(epochs):
    # ... training ...
    
    accuracy = evaluate(model, test_loader)
    print(f"epoch={epoch} accuracy={accuracy:.6f}")

Integrating with Kubeflow Pipelines

Run Katib experiments as pipeline components:

from kfp import dsl
from kfp.components import create_component_from_func

@dsl.component
def katib_tuning(
    experiment_name: str,
    max_trials: int = 20,
    parallel_trials: int = 3
) -> dict:
    """Run Katib hyperparameter tuning."""
    import kubernetes as k8s
    import time
    
    # Create experiment
    experiment_yaml = f"""
    apiVersion: kubeflow.org/v1beta1
    kind: Experiment
    metadata:
      name: {experiment_name}
    spec:
      ...
    """
    
    # Apply experiment
    # ... deployment code ...
    
    # Wait for completion
    # ... waiting code ...
    
    # Return best parameters
    return {
        'n_estimators': 150,
        'max_depth': 12,
        'learning_rate': 0.005
    }

@dsl.pipeline(name='ml-pipeline-with-tuning')
def ml_pipeline():
    """Pipeline with automated hyperparameter tuning."""
    
    # Step 1: Prepare data
    data = prepare_data()
    
    # Step 2: Tune hyperparameters
    best_params = katib_tuning(experiment_name='tune-model')
    
    # Step 3: Train final model with best params
    model = train_final_model(
        dataset=data.output,
        params=best_params.output
    )
    
    # Step 4: Deploy
    deploy_model(model=model.output)

Best Practices

1. Start with Random Search

Don't overcomplicate initially:

algorithm:
  algorithmName: random
maxTrialCount: 10  # Small number for quick feedback

2. Use Logarithmic Scale for Learning Rates

- name: learning-rate
  parameterType: double
  feasibleSpace:
    min: "0.00001"  # 1e-5
    max: "0.1"      # 1e-1

Consider log-uniform sampling in your training code.

3. Limit Search Space Initially

❌ Too broad:

- name: hidden-size
  feasibleSpace:
    min: "10"
    max: "10000"

✅ Reasonable range:

- name: hidden-size
  feasibleSpace:
    min: "64"
    max: "512"

4. Monitor Resource Usage

# Check trial pods
kubectl get pods -n ml-workspace -l katib.kubeflow.org/experiment=random-forest-tuning

# Check resource consumption
kubectl top pods -n ml-workspace

5. Use Early Stopping for Deep Learning

Saves significant compute time:

earlyStopping:
  algorithmName: medianstop

Common Issues

Trials Keep Failing

Debug:

kubectl logs <trial-pod-name> -n ml-workspace

Common causes:

Import errors: Missing dependencies in container
OOM: Increase memory limits
Wrong hyperparameter ranges: Check logs for value errors

No Improvement After Many Trials

Possible reasons:

Search space doesn't include optimal values
Objective metric not being captured correctly
Model/architecture limitations (hyperparameters won't help)

Solution: Review search space, verify metric logging, consider model changes

Slow Experiment Progress

Speed it up:

parallelTrialCount: 8  # Increase parallel trials (if resources allow)
maxTrialCount: 15      # Reduce total trials

Key Takeaways

Katib automates hyperparameter search at scale
Start simple: random search with limited trials
Use Bayesian/TPE for expensive training
Early stopping saves compute for deep learning
Integrate with pipelines for end-to-end automation

Next Steps

With optimized models, it's time to deploy them. In Model Serving with KServe, we'll learn how to serve models for production inference.

Resources:

PreviousKubeflow Pipelines NextModel Serving with KServe

Last updated 1 month ago

hashtagThe Hyperparameter Problem

hashtagWhat is Katib?

hashtagYour First Katib Experiment

hashtagStep 1: Create a Training Script

hashtagStep 2: Containerize Your Training Code

hashtagStep 3: Define Katib Experiment

hashtagStep 4: Run the Experiment

hashtagStep 5: View Results

hashtagOptimization Algorithms

hashtag1. Random Search

hashtag2. Grid Search

hashtag3. Bayesian Optimization

hashtag4. Hyperband

hashtag5. TPE (Tree-structured Parzen Estimator)

hashtagParameter Types

hashtagContinuous Parameters

hashtagDiscrete Integer Parameters

hashtagCategorical Parameters

hashtagPractical Example: Deep Learning with PyTorch

hashtagTraining Script

hashtagKatib Experiment for PyTorch

hashtagEarly Stopping

hashtagIntegrating with Kubeflow Pipelines

hashtagBest Practices

hashtag1. Start with Random Search

hashtag2. Use Logarithmic Scale for Learning Rates

hashtag3. Limit Search Space Initially

hashtag4. Monitor Resource Usage

hashtag5. Use Early Stopping for Deep Learning

hashtagCommon Issues

hashtagTrials Keep Failing

hashtagNo Improvement After Many Trials

hashtagSlow Experiment Progress

hashtagKey Takeaways

hashtagNext Steps