Monitoring & Observability

Why Monitoring Matters

Deployed a model? Congratulations! Now the real work begins.

Unlike traditional software, ML models can fail silently:

  • Code runs fine, but predictions are garbage

  • Input data distribution changes (drift)

  • Model accuracy degrades over time

  • Edge cases appear that weren't in training data

You won't know unless you monitor.

What to Monitor

1. Model Performance Metrics

Track the metrics you care about:

# metrics_collector.py (Python 3.12)
from prometheus_client import Counter, Histogram, Gauge
import time

# Request metrics
prediction_requests = Counter(
    'model_prediction_requests_total',
    'Total prediction requests',
    ['model_name', 'model_version']
)

prediction_errors = Counter(
    'model_prediction_errors_total',
    'Total prediction errors',
    ['model_name', 'error_type']
)

# Latency
prediction_latency = Histogram(
    'model_prediction_latency_seconds',
    'Prediction latency in seconds',
    ['model_name']
)

# Model metrics
model_accuracy = Gauge(
    'model_accuracy',
    'Current model accuracy',
    ['model_name', 'model_version']
)

def track_prediction(model_name: str, model_version: str):
    """Decorator to track predictions."""
    def decorator(func):
        def wrapper(*args, **kwargs):
            prediction_requests.labels(
                model_name=model_name,
                model_version=model_version
            ).inc()
            
            start_time = time.time()
            try:
                result = func(*args, **kwargs)
                latency = time.time() - start_time
                prediction_latency.labels(model_name=model_name).observe(latency)
                return result
            except Exception as e:
                prediction_errors.labels(
                    model_name=model_name,
                    error_type=type(e).__name__
                ).inc()
                raise
        
        return wrapper
    return decorator

# Usage in inference service
@track_prediction('iris-classifier', 'v1.0.0')
def predict(instances):
    return model.predict(instances)

2. Data Drift Detection

Monitor if input data changes:

3. Prediction Distribution

Track what your model is predicting:

Setting Up Prometheus & Grafana

Install Prometheus

Expose Metrics from Your Model

Create Grafana Dashboards

Access Grafana:

Dashboard panels:

  • Prediction Rate: rate(predictions_total[5m])

  • Error Rate: rate(prediction_errors_total[5m])

  • P95 Latency: histogram_quantile(0.95, prediction_latency_seconds)

  • Model Accuracy: model_accuracy

Logging Best Practices

Structured Logging

Alerting

Define Alert Rules

Notification Channels

Configure alerts to Slack, email, or PagerDuty.

Automated Retraining Triggers

Monitor performance and trigger retraining:

Key Takeaways

  1. Monitor model performance, not just system metrics

  2. Detect data drift before it impacts predictions

  3. Use structured logging for debuggability

  4. Set up alerts for degraded performance

  5. Automate retraining when needed

Next Steps

With monitoring in place, let's automate the entire workflow. In CI/CD for ML, we'll build pipelines that test, validate, and deploy models automatically.


Resources:

Last updated