Kubeflow Notebooks

Why Kubeflow Notebooks?

If you've worked with Jupyter notebooks, you know the frustration:

  • "Works on my laptop" but fails elsewhere

  • Sharing notebooks means sharing environment setup instructions

  • Accessing GPUs requires complex configuration

  • Collaboration means emailing .ipynb files back and forth

Kubeflow Notebooks solves these problems by providing:

  • Standardized environments: Everyone uses the same base images

  • Resource management: Request CPUs, GPUs, and memory through UI

  • Persistence: Your work survives pod restarts

  • Access control: Team members get their own isolated workspaces

  • Version control integration: Easy Git integration

Creating Your First Notebook Server

Via the UI

  1. Open Kubeflow Dashboard (http://localhost:8080)

  2. Navigate to Notebooks in the left sidebar

  3. Select your namespace (e.g., ml-workspace)

  4. Click New Notebook

Configuration Options

Basic Settings

Naming Convention I Use:

  • dev-<username>-<project>: Development notebooks

  • exp-<experiment-name>: Experiment notebooks

  • shared-<team>: Shared team notebooks

Image Selection

Kubeflow provides several base images:

My Recommendation: Start with scipy-notebook and add what you need.

Resource Requests

Guidelines:

  • Development: 1 CPU, 2-4Gi RAM

  • Training small models: 2 CPU, 8Gi RAM

  • Training with GPU: 1-2 GPU, 16Gi+ RAM

Workspace Volume

This is where your notebooks, data, and models live.

  • 10Gi: Good for code and small datasets

  • 50Gi: Medium datasets and multiple projects

  • 100Gi+: Large experiments (but consider object storage instead)

Data Volumes (Optional)

Add additional volumes for:

  • Shared datasets (mount read-only across notebooks)

  • Shared models

  • Persistent data storage

Via YAML (Advanced)

For reproducible setups, define notebooks in YAML:

Apply it:

Setting Up Python 3.12 Environment

Once your notebook is running, click Connect to access JupyterLab.

Install Additional Packages

Create a new terminal in JupyterLab and install packages:

Important: Use --user flag to install in your home directory (which is persisted).

Create a Conda Environment (Better Approach)

For complex dependency management:

Now you can select this kernel in your notebooks.

Verify Installation

Create a new notebook and verify:

Working with Data in Notebooks

Option 1: Upload Small Files

Use JupyterLab's file browser:

  1. Click upload icon

  2. Select files

  3. Files go to /home/jovyan/

Good for: Small datasets (<100MB), configuration files, scripts

Option 2: Download from URLs

Option 3: Mount Object Storage

For larger datasets, use object storage (S3, GCS, Azure Blob).

S3 Example:

Better: Use environment variables for credentials

Add to notebook server configuration:

Then in notebook:

Option 4: Shared Data Volumes

For datasets shared across notebooks:

  1. Create a PVC with data:

  1. Mount in notebooks (via UI or YAML)

  2. Access at /data/

One-time setup to populate:

Version Control Integration

Git Configuration

In a notebook terminal:

GitHub Authentication

Option 1: Personal Access Token

Option 2: SSH Keys

Workflow

Pro Tip: Clear notebook outputs before committing:

Or use nbstripout:

Development Workflow

Typical Session

  1. Start with exploration

  1. Feature engineering

  1. Model training

  1. Save model

  1. Document and commit

Organizing Notebooks

My structure:

Moving from Notebooks to Scripts

When a notebook becomes stable:

Example conversion:

Notebook:

Script (src/preprocessing.py):

Now import in notebooks:

Best Practices

1. Name Notebooks Descriptively

❌ Untitled1.ipynb, test.ipynb βœ… 01_data_exploration.ipynb, 02_feature_engineering.ipynb

2. Use Markdown Cells

Document your thinking:

3. Keep Notebooks Focused

One notebook = One purpose

  • Exploration notebook

  • Feature engineering notebook

  • Model training notebook

  • Evaluation notebook

Don't create 1000-line notebooks that do everything.

4. Clean Outputs Before Committing

Large outputs bloat Git history:

5. Pin Package Versions

6. Use Functions

Even in notebooks:

Easier to test and reuse.

Troubleshooting

Notebook Won't Start

Check pod status:

Common issues:

  • Insufficient resources: Reduce CPU/RAM request

  • Image pull error: Check image name and registry access

  • Storage issue: Check PVC exists and has capacity

Kernel Keeps Dying

Cause: Usually memory issues

Solution:

Workaround: Process data in chunks

Lost Work

Prevention:

  • Save frequently (Ctrl+S)

  • Commit to Git regularly

  • Workspace volume persists, but back up important work

Recovery: JupyterLab has autosave. Look for checkpoints:

Key Takeaways

  1. Kubeflow Notebooks provide standardized, reproducible environments

  2. Use version control from day one

  3. Organize code into modules as projects mature

  4. Request appropriate resourcesβ€”start small, scale up

  5. Leverage shared data volumes for team collaboration

Next Steps

Now that you can develop interactively, let's productionize workflows with Kubeflow Pipelines, where we'll convert notebook code into reproducible, scalable pipelines.


Resources:

Last updated