Kubeflow Overview & Setup

What is Kubeflow?

Kubeflow is an open-source platform that makes deploying machine learning workflows on Kubernetes simple, portable, and scalable. Think of it as a collection of tools that work together to handle the entire ML lifecycleβ€”from experimentation to production.

The key insight: Instead of building your own MLOps infrastructure from scratch, Kubeflow provides production-ready components that integrate seamlessly.

Kubeflow Architecture

Kubeflow is modular. You can use all components or just the ones you need. Here's what's available:

Core Components

1. Kubeflow Notebooks

  • What: Managed Jupyter notebook environments

  • Why: Standardized, reproducible development environments with access to cluster resources

  • When to Use: Interactive model development, experimentation, prototyping

2. Kubeflow Pipelines

  • What: Platform for building and deploying ML workflows

  • Why: Reproducible, scalable pipeline orchestration with versioning and tracking

  • When to Use: Training pipelines, batch inference, automated workflows

3. Katib

  • What: Hyperparameter tuning and neural architecture search

  • Why: Automated optimization of model parameters

  • When to Use: Finding optimal hyperparameters, AutoML experiments

4. KServe (formerly KFServing)

  • What: Model serving platform for production inference

  • Why: Standardized serving with auto-scaling, canary deployments, and multi-framework support

  • When to Use: Deploying models for real-time or batch inference

5. Model Registry

  • What: Central repository for ML models

  • Why: Version control, metadata tracking, and model lineage

  • When to Use: Managing multiple model versions, tracking experiments

6. Kubeflow Trainer

  • What: Distributed training operators for various frameworks

  • Why: Simplifies distributed training on Kubernetes

  • When to Use: Training large models that don't fit on single GPU/node

7. Kubeflow Dashboard

  • What: Central UI for managing all Kubeflow components

  • Why: Simplified access and management

  • When to Use: Monitoring, managing notebooks, pipelines, and experiments

How Components Work Together

Local Setup Options

For learning and development, you have several options to run Kubeflow locally.

Minikube runs a single-node Kubernetes cluster on your machine.

Pros:

  • Easy to start and stop

  • Good for development and testing

  • Lower resource requirements

Cons:

  • Single node limits some features

  • Not suitable for production-like testing

Setup:

Option 2: Kind (Kubernetes in Docker)

Kind runs Kubernetes clusters in Docker containers.

Pros:

  • Faster than Minikube

  • Can simulate multi-node clusters

  • Better for CI/CD testing

Cons:

  • Requires Docker

  • More complex networking

Setup:

Option 3: Docker Desktop Kubernetes

If you already have Docker Desktop, you can enable its built-in Kubernetes.

Pros:

  • Already have it if using Docker Desktop

  • Simple one-click enable

Cons:

  • Limited configuration

  • Resource sharing with Docker

Setup:

  1. Open Docker Desktop

  2. Settings β†’ Kubernetes β†’ Enable Kubernetes

  3. Wait for cluster to start

Installing Kubeflow

I'll show you two approaches: full installation and minimal installation.

This installs all Kubeflow components.

Check Installation:

Minimal Installation (Specific Components Only)

If you only need specific components:

Kubeflow Pipelines Only:

Kubeflow Notebooks Only:

Verifying Your Installation

Check All Components

Access the Dashboard

First Login:

Important: These are default credentials. In production, you'd configure proper authentication.

Initial Configuration

Create a Namespace

Kubeflow uses namespaces (called "profiles") to isolate user workspaces.

Configure Python 3.12 Notebook Image

Create a custom notebook image with Python 3.12:

Build and push:

Create Your First Notebook

  1. Open Kubeflow Dashboard (http://localhost:8080)

  2. Go to "Notebooks" in the sidebar

  3. Click "New Notebook"

  4. Configure:

    • Name: my-first-notebook

    • Image: Custom image or jupyter/scipy-notebook:python-3.12

    • CPU: 1

    • Memory: 2Gi

    • Workspace Volume: 10Gi

  5. Click "Launch"

Wait for the notebook to start, then click "Connect".

Test Installation with Sample Code

In your new notebook, test that everything works:

Understanding Kubeflow Resource Management

Notebook Resources

When creating notebooks, understand resource allocation:

Best Practices:

  • Start small (0.5 CPU, 1Gi RAM)

  • Increase based on actual needs

  • Monitor resource usage in dashboard

  • Delete notebooks when not in use

Storage Considerations

Kubeflow creates PersistentVolumeClaims for notebooks:

Important: Data in the workspace volume persists, but:

  • Be mindful of storage costs

  • Back up important work to Git

  • Use object storage (S3, GCS) for large datasets

Common Setup Issues and Solutions

Issue 1: Pods Stuck in Pending

Symptom: kubectl get pods -n kubeflow shows pods in Pending state

Debug:

Common Causes:

  • Insufficient resources: Increase minikube memory/CPU

  • Missing storage class: Check kubectl get sc

  • Image pull errors: Check image accessibility

Solution:

Issue 2: Port Forward Disconnects

Symptom: kubectl port-forward command exits

Solution:

Or use a tool like kubefwd:

Issue 3: Cannot Access Dashboard

Symptom: Browser shows "connection refused"

Debug:

Solution:

Issue 4: Out of Disk Space

Symptom: Pods fail with disk pressure errors

Solution:

Production vs Development Setup

Development (What We Just Set Up)

  • Single node cluster

  • Default authentication

  • Limited resources

  • Local storage

Production (What You'd Need)

  • Multi-node Kubernetes cluster (EKS, GKE, AKS)

  • Enterprise authentication (OAuth, LDAP)

  • Dedicated resources per component

  • Cloud storage integration (S3, GCS, Azure Blob)

  • Monitoring and logging (Prometheus, Grafana)

  • TLS/SSL certificates

  • Network policies and security

  • Backup and disaster recovery

Reality Check: Start with development setup. Move to production only when you have:

  1. Multiple users

  2. Production models to serve

  3. Compliance/security requirements

  4. Budget for cloud resources

Next Steps

Now that Kubeflow is running, you can:

  1. Explore the Dashboard: Familiarize yourself with the UI

  2. Create Notebooks: Start with Kubeflow Notebooks

  3. Build Pipelines: Learn Kubeflow Pipelines

  4. Experiment: The best way to learn is by doing

Key Takeaways

  1. Kubeflow is modular: Install only what you need

  2. Start local: Minikube or Kind for learning

  3. Resource management matters: Even locally, give sufficient CPU/RAM

  4. Namespaces isolate work: Use profiles for organization

  5. Persistence is automatic: Notebook data survives pod restarts

In the next article, we'll dive deep into Kubeflow Notebooks and learn how to set up productive development environments with Python 3.12.


Resources:

Last updated