Kubeflow Overview & Setup

What is Kubeflow?

Kubeflow is an open-source platform that makes deploying machine learning workflows on Kubernetes simple, portable, and scalable. Think of it as a collection of tools that work together to handle the entire ML lifecycle—from experimentation to production.

The key insight: Instead of building your own MLOps infrastructure from scratch, Kubeflow provides production-ready components that integrate seamlessly.

Kubeflow Architecture

Kubeflow is modular. You can use all components or just the ones you need. Here's what's available:

Core Components

1. Kubeflow Notebooks

What: Managed Jupyter notebook environments
Why: Standardized, reproducible development environments with access to cluster resources
When to Use: Interactive model development, experimentation, prototyping

2. Kubeflow Pipelines

What: Platform for building and deploying ML workflows
Why: Reproducible, scalable pipeline orchestration with versioning and tracking
When to Use: Training pipelines, batch inference, automated workflows

3. Katib

What: Hyperparameter tuning and neural architecture search
Why: Automated optimization of model parameters
When to Use: Finding optimal hyperparameters, AutoML experiments

4. KServe (formerly KFServing)

What: Model serving platform for production inference
Why: Standardized serving with auto-scaling, canary deployments, and multi-framework support
When to Use: Deploying models for real-time or batch inference

5. Model Registry

What: Central repository for ML models
Why: Version control, metadata tracking, and model lineage
When to Use: Managing multiple model versions, tracking experiments

6. Kubeflow Trainer

What: Distributed training operators for various frameworks
Why: Simplifies distributed training on Kubernetes
When to Use: Training large models that don't fit on single GPU/node

7. Kubeflow Dashboard

What: Central UI for managing all Kubeflow components
Why: Simplified access and management
When to Use: Monitoring, managing notebooks, pipelines, and experiments

How Components Work Together

Development Phase:
  Kubeflow Notebooks → Experiment, prototype, validate

Training Phase:
  Kubeflow Pipelines → Orchestrate data prep, training, validation
  Katib → Optimize hyperparameters
  Trainer → Distributed training if needed

Model Management:
  Model Registry → Version and track models

Deployment Phase:
  KServe → Serve models for inference
  
Monitoring:
  Pipelines + External tools → Track performance, retrain

Local Setup Options

For learning and development, you have several options to run Kubeflow locally.

Option 1: Minikube (Recommended for Learning)

Minikube runs a single-node Kubernetes cluster on your machine.

Pros:

Easy to start and stop
Good for development and testing
Lower resource requirements

Cons:

Single node limits some features
Not suitable for production-like testing

Setup:

# Install minikube (macOS)
brew install minikube

# Start minikube with sufficient resources
minikube start \
  --cpus=4 \
  --memory=8192 \
  --disk-size=40g \
  --kubernetes-version=v1.28.0

# Verify
kubectl get nodes

Option 2: Kind (Kubernetes in Docker)

Kind runs Kubernetes clusters in Docker containers.

Pros:

Faster than Minikube
Can simulate multi-node clusters
Better for CI/CD testing

Cons:

Requires Docker
More complex networking

Setup:

# Install kind (macOS)
brew install kind

# Create cluster
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
EOF

# Verify
kubectl get nodes

Option 3: Docker Desktop Kubernetes

If you already have Docker Desktop, you can enable its built-in Kubernetes.

Pros:

Already have it if using Docker Desktop
Simple one-click enable

Cons:

Limited configuration
Resource sharing with Docker

Setup:

Open Docker Desktop
Settings → Kubernetes → Enable Kubernetes
Wait for cluster to start

Installing Kubeflow

I'll show you two approaches: full installation and minimal installation.

Full Installation (Recommended for Comprehensive Learning)

This installs all Kubeflow components.

# Install kustomize
brew install kustomize

# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Checkout stable version
git checkout v1.8.0

# Install everything (this takes 10-15 minutes)
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod --all --all-namespaces --timeout=600s

Check Installation:

# Check all pods are running
kubectl get pods -n kubeflow

# Access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# Open browser to http://localhost:8080
# Default credentials: [email protected] / 12341234

Minimal Installation (Specific Components Only)

If you only need specific components:

Kubeflow Pipelines Only:

# Using standalone Kubeflow Pipelines
export PIPELINE_VERSION=2.0.5
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"

# Access UI
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Kubeflow Notebooks Only:

# Install notebook controller
kubectl apply -k "github.com/kubeflow/kubeflow/components/notebook-controller/config/overlays/kubeflow?ref=v1.8.0"

# Install profile controller (required for notebooks)
kubectl apply -k "github.com/kubeflow/kubeflow/components/profile-controller/config/overlays/kubeflow?ref=v1.8.0"

Verifying Your Installation

Check All Components

# Check all namespaces
kubectl get namespaces

# You should see:
# - kubeflow
# - kubeflow-user-example-com
# - istio-system
# - cert-manager
# - auth

# Check kubeflow pods
kubectl get pods -n kubeflow

# All should be Running or Completed

Access the Dashboard

# Port forward to access UI
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# Open browser: http://localhost:8080

First Login:

Email: [email protected]
Password: 12341234

Important: These are default credentials. In production, you'd configure proper authentication.

Initial Configuration

Create a Namespace

Kubeflow uses namespaces (called "profiles") to isolate user workspaces.

# Create profile for yourself
cat <<EOF | kubectl apply -f -
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: ml-workspace
spec:
  owner:
    kind: User
    name: [email protected]
EOF

# Verify
kubectl get profiles

Configure Python 3.12 Notebook Image

Create a custom notebook image with Python 3.12:

# Dockerfile
FROM jupyter/scipy-notebook:python-3.12

USER root

# Install additional ML libraries
RUN pip install --no-cache-dir \
    kubeflow-kfp>=2.6.0 \
    tensorflow>=2.15.0 \
    torch>=2.1.0 \
    scikit-learn>=1.3.0 \
    mlflow>=2.10.0 \
    pandas>=2.1.0 \
    matplotlib>=3.8.0 \
    seaborn>=0.13.0

USER $NB_UID

WORKDIR /home/jovyan

Build and push:

# Build image
docker build -t my-registry/kubeflow-notebook:py312 .

# Push to registry (or use local registry with Minikube)
docker push my-registry/kubeflow-notebook:py312

# Or for Minikube
minikube image load my-registry/kubeflow-notebook:py312

Create Your First Notebook

Open Kubeflow Dashboard (http://localhost:8080)
Go to "Notebooks" in the sidebar
Click "New Notebook"
Configure:
- Name: my-first-notebook
- Image: Custom image or jupyter/scipy-notebook:python-3.12
- CPU: 1
- Memory: 2Gi
- Workspace Volume: 10Gi
Click "Launch"

Wait for the notebook to start, then click "Connect".

Test Installation with Sample Code

In your new notebook, test that everything works:

# Cell 1: Verify Python version
import sys
print(f"Python version: {sys.version}")

# Cell 2: Import ML libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import tensorflow as tf
import torch

print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"TensorFlow: {tf.__version__}")
print(f"PyTorch: {torch.__version__}")

# Cell 3: Test Kubeflow SDK
import kfp
from kfp import dsl

print(f"Kubeflow Pipelines SDK: {kfp.__version__}")

# Cell 4: Simple training test
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy:.2%}")

Understanding Kubeflow Resource Management

Notebook Resources

When creating notebooks, understand resource allocation:

# This is what happens behind the scenes
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: notebook
    image: jupyter/scipy-notebook:python-3.12
    resources:
      requests:
        cpu: "0.5"
        memory: "1Gi"
      limits:
        cpu: "1"
        memory: "2Gi"

Best Practices:

Start small (0.5 CPU, 1Gi RAM)
Increase based on actual needs
Monitor resource usage in dashboard
Delete notebooks when not in use

Storage Considerations

Kubeflow creates PersistentVolumeClaims for notebooks:

# View PVCs
kubectl get pvc -n ml-workspace

# Each notebook gets a workspace volume
# Data persists even if notebook pod is deleted

Important: Data in the workspace volume persists, but:

Be mindful of storage costs
Back up important work to Git
Use object storage (S3, GCS) for large datasets

Common Setup Issues and Solutions

Issue 1: Pods Stuck in Pending

Symptom: kubectl get pods -n kubeflow shows pods in Pending state

Debug:

kubectl describe pod <pod-name> -n kubeflow

Common Causes:

Insufficient resources: Increase minikube memory/CPU
Missing storage class: Check kubectl get sc
Image pull errors: Check image accessibility

Solution:

# Restart minikube with more resources
minikube stop
minikube start --cpus=6 --memory=12288

Issue 2: Port Forward Disconnects

Symptom: kubectl port-forward command exits

Solution:

# Run in background with automatic restart
while true; do
  kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
  sleep 5
done

Or use a tool like kubefwd:

brew install txn2/tap/kubefwd
sudo kubefwd svc -n istio-system

Issue 3: Cannot Access Dashboard

Symptom: Browser shows "connection refused"

Debug:

# Check ingress gateway is running
kubectl get pods -n istio-system

# Check service
kubectl get svc -n istio-system istio-ingressgateway

Solution:

# Restart port-forward with correct port
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Issue 4: Out of Disk Space

Symptom: Pods fail with disk pressure errors

Solution:

# Clean up Docker images
docker system prune -a

# Or increase minikube disk
minikube stop
minikube delete
minikube start --disk-size=50g --cpus=4 --memory=8192

Production vs Development Setup

Development (What We Just Set Up)

Single node cluster
Default authentication
Limited resources
Local storage

Production (What You'd Need)

Multi-node Kubernetes cluster (EKS, GKE, AKS)
Enterprise authentication (OAuth, LDAP)
Dedicated resources per component
Cloud storage integration (S3, GCS, Azure Blob)
Monitoring and logging (Prometheus, Grafana)
TLS/SSL certificates
Network policies and security
Backup and disaster recovery

Reality Check: Start with development setup. Move to production only when you have:

Multiple users
Production models to serve
Compliance/security requirements
Budget for cloud resources

Next Steps

Now that Kubeflow is running, you can:

Explore the Dashboard: Familiarize yourself with the UI
Create Notebooks: Start with Kubeflow Notebooks
Build Pipelines: Learn Kubeflow Pipelines
Experiment: The best way to learn is by doing

Key Takeaways

Kubeflow is modular: Install only what you need
Start local: Minikube or Kind for learning
Resource management matters: Even locally, give sufficient CPU/RAM
Namespaces isolate work: Use profiles for organization
Persistence is automatic: Notebook data survives pod restarts

In the next article, we'll dive deep into Kubeflow Notebooks and learn how to set up productive development environments with Python 3.12.

Resources:

PreviousKubernetes for MLOps NextKubeflow Notebooks

Last updated 1 month ago

hashtagWhat is Kubeflow?

hashtagKubeflow Architecture

hashtagCore Components

hashtag1. Kubeflow Notebooks

hashtag2. Kubeflow Pipelines

hashtag3. Katib

hashtag4. KServe (formerly KFServing)

hashtag5. Model Registry

hashtag6. Kubeflow Trainer

hashtag7. Kubeflow Dashboard

hashtagHow Components Work Together

hashtagLocal Setup Options

hashtagOption 1: Minikube (Recommended for Learning)

hashtagOption 2: Kind (Kubernetes in Docker)

hashtagOption 3: Docker Desktop Kubernetes

hashtagInstalling Kubeflow

hashtagFull Installation (Recommended for Comprehensive Learning)

hashtagMinimal Installation (Specific Components Only)

hashtagVerifying Your Installation

hashtagCheck All Components

hashtagAccess the Dashboard

hashtagInitial Configuration

hashtagCreate a Namespace

hashtagConfigure Python 3.12 Notebook Image

hashtagCreate Your First Notebook

hashtagTest Installation with Sample Code

hashtagUnderstanding Kubeflow Resource Management

hashtagNotebook Resources

hashtagStorage Considerations

hashtagCommon Setup Issues and Solutions

hashtagIssue 1: Pods Stuck in Pending

hashtagIssue 2: Port Forward Disconnects

hashtagIssue 3: Cannot Access Dashboard

hashtagIssue 4: Out of Disk Space

hashtagProduction vs Development Setup

hashtagDevelopment (What We Just Set Up)

hashtagProduction (What You'd Need)

hashtagNext Steps

hashtagKey Takeaways

What is Kubeflow?

Kubeflow Architecture

Core Components

1. Kubeflow Notebooks

2. Kubeflow Pipelines

3. Katib

4. KServe (formerly KFServing)

5. Model Registry

6. Kubeflow Trainer

7. Kubeflow Dashboard

How Components Work Together

Local Setup Options

Option 1: Minikube (Recommended for Learning)

Option 2: Kind (Kubernetes in Docker)

Option 3: Docker Desktop Kubernetes

Installing Kubeflow

Full Installation (Recommended for Comprehensive Learning)

Minimal Installation (Specific Components Only)

Verifying Your Installation

Check All Components

Access the Dashboard

Initial Configuration

Create a Namespace

Configure Python 3.12 Notebook Image

Create Your First Notebook

Test Installation with Sample Code

Understanding Kubeflow Resource Management

Notebook Resources

Storage Considerations

Common Setup Issues and Solutions

Issue 1: Pods Stuck in Pending

Issue 2: Port Forward Disconnects

Issue 3: Cannot Access Dashboard

Issue 4: Out of Disk Space

Production vs Development Setup

Development (What We Just Set Up)

Production (What You'd Need)

Next Steps

Key Takeaways