Kubeflow Notebooks

Why Kubeflow Notebooks?

If you've worked with Jupyter notebooks, you know the frustration:

"Works on my laptop" but fails elsewhere
Sharing notebooks means sharing environment setup instructions
Accessing GPUs requires complex configuration
Collaboration means emailing .ipynb files back and forth

Kubeflow Notebooks solves these problems by providing:

Standardized environments: Everyone uses the same base images
Resource management: Request CPUs, GPUs, and memory through UI
Persistence: Your work survives pod restarts
Access control: Team members get their own isolated workspaces
Version control integration: Easy Git integration

Creating Your First Notebook Server

Via the UI

Open Kubeflow Dashboard (http://localhost:8080)
Navigate to Notebooks in the left sidebar
Select your namespace (e.g., ml-workspace)
Click New Notebook

Configuration Options

Basic Settings

Name: my-ml-notebook
Namespace: ml-workspace

Naming Convention I Use:

dev-<username>-<project>: Development notebooks
exp-<experiment-name>: Experiment notebooks
shared-<team>: Shared team notebooks

Image Selection

Kubeflow provides several base images:

jupyter/scipy-notebook:python-3.12
jupyter/tensorflow-notebook:python-3.12
jupyter/pytorch-notebook:python-3.12

My Recommendation: Start with scipy-notebook and add what you need.

Resource Requests

CPU: 1.0
RAM: 4Gi
GPU: 0 (or 1 if available)

Guidelines:

Development: 1 CPU, 2-4Gi RAM
Training small models: 2 CPU, 8Gi RAM
Training with GPU: 1-2 GPU, 16Gi+ RAM

Workspace Volume

Type: New
Size: 10Gi

This is where your notebooks, data, and models live.

10Gi: Good for code and small datasets
50Gi: Medium datasets and multiple projects
100Gi+: Large experiments (but consider object storage instead)

Data Volumes (Optional)

Add additional volumes for:

Shared datasets (mount read-only across notebooks)
Shared models
Persistent data storage

Via YAML (Advanced)

For reproducible setups, define notebooks in YAML:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: ml-dev-notebook
  namespace: ml-workspace
spec:
  template:
    spec:
      containers:
      - name: notebook
        image: jupyter/scipy-notebook:python-3.12
        resources:
          requests:
            cpu: "1.0"
            memory: "4Gi"
          limits:
            cpu: "2.0"
            memory: "8Gi"
        volumeMounts:
        - name: workspace
          mountPath: /home/jovyan
        - name: shared-data
          mountPath: /data
          readOnly: true
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: ml-dev-workspace
      - name: shared-data
        persistentVolumeClaim:
          claimName: shared-datasets

Apply it:

kubectl apply -f notebook.yaml

Setting Up Python 3.12 Environment

Once your notebook is running, click Connect to access JupyterLab.

Install Additional Packages

Create a new terminal in JupyterLab and install packages:

# Option 1: Install packages directly
pip install --user pandas scikit-learn matplotlib seaborn

# Option 2: Use requirements.txt
cat > requirements.txt << EOF
pandas>=2.1.0
scikit-learn>=1.3.0
matplotlib>=3.8.0
seaborn>=0.13.0
xgboost>=2.0.0
lightgbm>=4.1.0
optuna>=3.5.0
EOF

pip install --user -r requirements.txt

Important: Use --user flag to install in your home directory (which is persisted).

Create a Conda Environment (Better Approach)

For complex dependency management:

# Create environment
conda create -n ml-project python=3.12 -y
conda activate ml-project

# Install packages
conda install -c conda-forge \
  pandas numpy scipy scikit-learn \
  matplotlib seaborn plotly \
  jupyterlab ipykernel

# Install via pip
pip install xgboost lightgbm catboost optuna

# Register kernel with Jupyter
python -m ipykernel install --user --name ml-project --display-name "Python 3.12 (ML Project)"

Now you can select this kernel in your notebooks.

Verify Installation

Create a new notebook and verify:

# Cell 1: Check Python version
import sys
print(f"Python: {sys.version}")
print(f"Executable: {sys.executable}")

# Cell 2: Import and check versions
import pandas as pd
import numpy as np
import sklearn
import matplotlib
import seaborn as sns

print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scikit-learn: {sklearn.__version__}")
print(f"matplotlib: {matplotlib.__version__}")
print(f"seaborn: {sns.__version__}")

# Cell 3: Test GPU (if available)
try:
    import torch
    print(f"PyTorch: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA device: {torch.cuda.get_device_name(0)}")
except ImportError:
    print("PyTorch not installed")

try:
    import tensorflow as tf
    print(f"TensorFlow: {tf.__version__}")
    print(f"GPUs: {tf.config.list_physical_devices('GPU')}")
except ImportError:
    print("TensorFlow not installed")

Working with Data in Notebooks

Option 1: Upload Small Files

Use JupyterLab's file browser:

Click upload icon
Select files
Files go to /home/jovyan/

Good for: Small datasets (<100MB), configuration files, scripts

Option 2: Download from URLs

import urllib.request
import pandas as pd

# Download CSV
url = "https://example.com/dataset.csv"
urllib.request.urlretrieve(url, "dataset.csv")

df = pd.read_csv("dataset.csv")
print(df.head())

Option 3: Mount Object Storage

For larger datasets, use object storage (S3, GCS, Azure Blob).

S3 Example:

import boto3
import pandas as pd
from io import StringIO

# Configure AWS credentials (use environment variables)
s3 = boto3.client('s3',
    aws_access_key_id='YOUR_KEY',
    aws_secret_access_key='YOUR_SECRET'
)

# Read directly
obj = s3.get_object(Bucket='my-bucket', Key='data/dataset.csv')
df = pd.read_csv(StringIO(obj['Body'].read().decode('utf-8')))

Better: Use environment variables for credentials

Add to notebook server configuration:

spec:
  template:
    spec:
      containers:
      - name: notebook
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-key

Then in notebook:

import boto3
import os

# Credentials automatically loaded from environment
s3 = boto3.client('s3')

Option 4: Shared Data Volumes

For datasets shared across notebooks:

Create a PVC with data:

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-datasets
  namespace: ml-workspace
spec:
  accessModes:
  - ReadWriteMany  # Multiple notebooks can read
  resources:
    requests:
      storage: 100Gi
EOF

Mount in notebooks (via UI or YAML)
Access at /data/

One-time setup to populate:

# Start a pod to upload data
kubectl run data-uploader --image=alpine -n ml-workspace --command -- sleep infinity

# Copy data
kubectl cp local-dataset.csv ml-workspace/data-uploader:/data/

# Delete pod
kubectl delete pod data-uploader -n ml-workspace

Version Control Integration

Git Configuration

In a notebook terminal:

# Configure Git
git config --global user.name "Your Name"
git config --global user.email "[email protected]"

# Clone repository
cd /home/jovyan
git clone https://github.com/yourusername/ml-project.git
cd ml-project

GitHub Authentication

Option 1: Personal Access Token

# Use token as password
git clone https://github.com/username/repo.git
# When prompted for password, use your PAT

Option 2: SSH Keys

# Generate SSH key
ssh-keygen -t ed25519 -C "[email protected]"

# Display public key
cat ~/.ssh/id_ed25519.pub

# Add to GitHub: Settings → SSH Keys → New SSH Key

Workflow

# Make changes to notebooks
# ...

# Save and commit
git add my_notebook.ipynb
git commit -m "Add data preprocessing pipeline"
git push origin main

Pro Tip: Clear notebook outputs before committing:

# In notebook, run this before saving
from IPython.display import clear_output
clear_output()

Or use nbstripout:

pip install --user nbstripout
nbstripout my_notebook.ipynb

Development Workflow

Typical Session

Start with exploration

# Load and explore data
import pandas as pd
df = pd.read_csv('/data/dataset.csv')
df.info()
df.describe()
df.head()

Feature engineering

# Create features
df['log_feature'] = np.log1p(df['feature'])
df['interaction'] = df['feat1'] * df['feat2']

Model training

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print(f"Accuracy: {model.score(X_test, y_test):.3f}")

Save model

import joblib
joblib.dump(model, 'models/rf_model_v1.pkl')

Document and commit

git add my_experiment.ipynb
git commit -m "Random forest baseline: 0.85 accuracy"
git push

Organizing Notebooks

My structure:

/home/jovyan/
├── data/               # Data (gitignored)
│   ├── raw/
│   ├── processed/
│   └── external/
├── notebooks/          # Experiments
│   ├── 01_exploration.ipynb
│   ├── 02_baseline.ipynb
│   └── 03_optimization.ipynb
├── src/                # Production code
│   ├── preprocessing.py
│   ├── features.py
│   ├── models.py
│   └── utils.py
├── models/             # Saved models (gitignored)
├── reports/            # Figures and reports
└── requirements.txt

Moving from Notebooks to Scripts

When a notebook becomes stable:

# Convert notebook to Python script
jupyter nbconvert --to script my_notebook.ipynb

# Clean up the output
# Remove cell markers, organize into functions

Example conversion:

Notebook:

# Cell 1
import pandas as pd
df = pd.read_csv('data.csv')

# Cell 2
df_clean = df.dropna()

# Cell 3
df_clean.to_csv('clean_data.csv', index=False)

Script (src/preprocessing.py):

import pandas as pd
from pathlib import Path

def load_data(filepath: str) -> pd.DataFrame:
    """Load raw data from CSV."""
    return pd.read_csv(filepath)

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """Remove missing values."""
    return df.dropna()

def save_data(df: pd.DataFrame, filepath: str) -> None:
    """Save cleaned data to CSV."""
    df.to_csv(filepath, index=False)

def main():
    df = load_data('data/raw/data.csv')
    df_clean = clean_data(df)
    save_data(df_clean, 'data/processed/clean_data.csv')

if __name__ == '__main__':
    main()

Now import in notebooks:

# In notebook
import sys
sys.path.append('/home/jovyan/src')

from preprocessing import load_data, clean_data

df = load_data('data/raw/data.csv')
df_clean = clean_data(df)

Best Practices

1. Name Notebooks Descriptively

❌ Untitled1.ipynb, test.ipynb ✅ 01_data_exploration.ipynb, 02_feature_engineering.ipynb

2. Use Markdown Cells

Document your thinking:

## Data Exploration

Questions to answer:
- What's the distribution of the target variable?
- Are there missing values?
- Any outliers?

Expected outcome: Identify data quality issues before modeling.

3. Keep Notebooks Focused

One notebook = One purpose

Exploration notebook
Feature engineering notebook
Model training notebook
Evaluation notebook

Don't create 1000-line notebooks that do everything.

4. Clean Outputs Before Committing

Large outputs bloat Git history:

# Remove all outputs
jupyter nbconvert --clear-output --inplace *.ipynb

5. Pin Package Versions

# At the top of notebooks
# Environment: Python 3.12, scikit-learn 1.3.0, pandas 2.1.0

6. Use Functions

Even in notebooks:

def load_and_preprocess(filepath):
    """Load and preprocess data."""
    df = pd.read_csv(filepath)
    df = df.dropna()
    df = df[df['value'] > 0]
    return df

df = load_and_preprocess('data.csv')

Easier to test and reuse.

Troubleshooting

Notebook Won't Start

Check pod status:

kubectl get pods -n ml-workspace
kubectl describe pod <notebook-pod-name> -n ml-workspace

Common issues:

Insufficient resources: Reduce CPU/RAM request
Image pull error: Check image name and registry access
Storage issue: Check PVC exists and has capacity

Kernel Keeps Dying

Cause: Usually memory issues

Solution:

# Check resource usage
kubectl top pods -n ml-workspace

# Increase memory limit in notebook configuration

Workaround: Process data in chunks

# Instead of loading all data
df = pd.read_csv('huge_file.csv')

# Load in chunks
chunks = pd.read_csv('huge_file.csv', chunksize=10000)
for chunk in chunks:
    process(chunk)

Lost Work

Prevention:

Save frequently (Ctrl+S)
Commit to Git regularly
Workspace volume persists, but back up important work

Recovery: JupyterLab has autosave. Look for checkpoints:

/home/jovyan/.ipynb_checkpoints/

Key Takeaways

Kubeflow Notebooks provide standardized, reproducible environments
Use version control from day one
Organize code into modules as projects mature
Request appropriate resources—start small, scale up
Leverage shared data volumes for team collaboration

Next Steps

Now that you can develop interactively, let's productionize workflows with Kubeflow Pipelines, where we'll convert notebook code into reproducible, scalable pipelines.

Resources:

PreviousKubeflow Overview & Setup NextKubeflow Pipelines

Last updated 1 month ago

hashtagWhy Kubeflow Notebooks?

hashtagCreating Your First Notebook Server

hashtagVia the UI

hashtagConfiguration Options

hashtagBasic Settings

hashtagImage Selection

hashtagResource Requests

hashtagWorkspace Volume

hashtagData Volumes (Optional)

hashtagVia YAML (Advanced)

hashtagSetting Up Python 3.12 Environment

hashtagInstall Additional Packages

hashtagCreate a Conda Environment (Better Approach)

hashtagVerify Installation

hashtagWorking with Data in Notebooks

hashtagOption 1: Upload Small Files

hashtagOption 2: Download from URLs

hashtagOption 3: Mount Object Storage

hashtagOption 4: Shared Data Volumes

hashtagVersion Control Integration

hashtagGit Configuration

hashtagGitHub Authentication

hashtagWorkflow

hashtagDevelopment Workflow

hashtagTypical Session

hashtagOrganizing Notebooks

hashtagMoving from Notebooks to Scripts

hashtagBest Practices

hashtag1. Name Notebooks Descriptively

hashtag2. Use Markdown Cells

hashtag3. Keep Notebooks Focused

hashtag4. Clean Outputs Before Committing

hashtag5. Pin Package Versions

hashtag6. Use Functions

hashtagTroubleshooting

hashtagNotebook Won't Start

hashtagKernel Keeps Dying

hashtagLost Work

hashtagKey Takeaways

hashtagNext Steps