Prometheus Best Practices: Lessons from Production

The Cardinality Explosion That Took Down Prometheus

I'll never forget the day our Prometheus server ran out of memory and crashed. We were monitoring a few thousand time series when suddenly, overnight, we had millions. Queries timed out. Dashboards wouldn't load. Alerts stopped firing. We were completely blind.

The culprit? One innocent-looking line of code:

requestCounter.inc({ user_id: req.user.id });

We had added user_id as a label. With 100,000 users and 20 endpoints, we created 2 million unique time series overnight. Prometheus couldn't handle it.

I learned an expensive lesson: label cardinality matters more than almost anything else in Prometheus.

This article is everything I wish I knew before running Prometheus in production.

The Cardinal Rule: Control Label Cardinality

What Is Cardinality?

Cardinality is the number of unique combinations of labels for a metric.

Example:

http_requests_total{method="GET", endpoint="/api/users", status_code="200"}

If you have:

  • 3 methods (GET, POST, DELETE)

  • 10 endpoints

  • 6 status codes (200, 201, 400, 401, 404, 500)

Cardinality = 3 Γ— 10 Γ— 6 = 180 time series

Manageable.

But if you add user_id:

  • 3 methods

  • 10 endpoints

  • 6 status codes

  • 100,000 users

Cardinality = 3 Γ— 10 Γ— 6 Γ— 100,000 = 18,000,000 time series

Disaster.

High-Cardinality Labels to AVOID

❌ Never use these as labels:

  • User IDs

  • Session IDs

  • Request IDs

  • Email addresses

  • IP addresses (unless aggregated)

  • Timestamps

  • UUIDs

  • File paths (full paths)

  • URLs (full URLs with query params)

Low-Cardinality Labels to USE

βœ… Good labels:

  • HTTP method (GET, POST, etc.) - ~10 values

  • HTTP status code (200, 404, 500, etc.) - ~20 values

  • Endpoint/route (/api/users, /api/products) - 10-100 values

  • Environment (dev, staging, prod) - 2-5 values

  • Service name (api, worker, scheduler) - 5-20 values

  • Instance/host - 10-100 values

  • Database table name - 10-50 values

  • User role (admin, user, guest) - 3-10 values

  • Region (us-east, us-west, eu) - 3-10 values

How to Handle High-Cardinality Data

Option 1: Don't track it at all

Often, you don't need per-user metrics. Aggregate data is usually sufficient.

Option 2: Use logs, not metrics

Metrics for aggregates, logs for specifics.

Option 3: Use exemplars (Prometheus 2.26+)

Exemplars let you link metrics to traces without increasing cardinality:

Option 4: Aggregate before exposing

Instead of tracking each user, track user buckets:

Metric Naming Conventions

After years of Prometheus, these conventions have saved me countless times.

The Standard Format

Examples:

Rules I Always Follow

  1. Use snake_case, not camelCase

  2. Suffix counters with _total

  3. Include units in the name

  4. Use base units

    • Time: seconds (not milliseconds)

    • Data: bytes (not kilobytes)

    • Percentage: 0-1 ratio (not 0-100)

  5. Group related metrics with prefixes

My Naming Patterns

HTTP Metrics:

Database Metrics:

Business Metrics:

Retention and Storage Management

Setting Appropriate Retention

Default retention is 15 days. I adjust based on needs:

My retention strategy:

  • Production: 30 days (catch monthly trends)

  • Staging: 7 days (enough for debugging)

  • Development: 3 days (minimal needs)

Storage Sizing

Rough calculation:

Real example:

  • 10,000 time series

  • Scrape every 15s = 4 samples/minute = 0.067 samples/second

  • 30-day retention = 2,592,000 seconds

  • ~2 bytes per sample (compressed)

In practice, expect 3-6 GB for 10,000 series over 30 days.

Monitoring Prometheus Itself

Always monitor your monitoring:

Alert when Prometheus struggles:

Scraping Best Practices

Scrape Intervals

Default: 15 seconds

This works for most cases. Adjust only when necessary:

Rule of thumb: Use the longest interval that meets your needs.

Scrape Timeout

Keep /metrics Fast

Your /metrics endpoint should respond in milliseconds:

Scrape Only What You Need

Don't expose every possible metric:

Recording Rules: When and How

Recording rules pre-compute expensive queries.

When to Use Recording Rules

βœ… Use for:

  • Expensive aggregations used in dashboards

  • Queries used in multiple alerts

  • Complex histogram calculations

  • Frequently accessed data

❌ Don't use for:

  • Simple queries (overhead not worth it)

  • Rarely used metrics

  • Already fast queries

My Production Recording Rules

Naming Convention for Recording Rules

Examples:

Label Best Practices

Keep Label Sets Consistent

All metrics in a family should have the same labels:

Use Meaningful Label Values

Don't Include Label Names in Values

Performance Optimization

Limit Label Cardinality Per Metric

My thresholds:

  • < 100 series: Perfect

  • 100-1,000 series: Good

  • 1,000-10,000 series: Monitor closely

  • 10,000+ series: Needs optimization

Use rate() Correctly

Always use rate() for counters, never for gauges:

Choose Appropriate Time Ranges

Rule: Use at least 4Γ— your scrape interval.

Aggregate Before Histogram Quantiles

Security Best Practices

Don't Expose Sensitive Data

Protect the /metrics Endpoint

Or use network policies in Kubernetes to restrict access.

Common Mistakes and Fixes

Mistake 1: Using Gauges for Counters

Mistake 2: Not Using Histogram Buckets Wisely

Mistake 3: Instrumenting Everything

Not every function call needs a metric. Focus on:

  • User-facing operations

  • External API calls

  • Database queries

  • Critical business operations

Mistake 4: Forgetting for in Alerts

My Production Checklist

Before deploying Prometheus to production:

Key Takeaways

  1. Control label cardinality - This is the #1 cause of Prometheus issues

  2. Name metrics consistently - Follow conventions religiously

  3. Use appropriate metric types - Counter for cumulative, gauge for snapshots, histogram for distributions

  4. Keep /metrics fast - No expensive operations in the handler

  5. Monitor Prometheus - Your monitoring needs monitoring

  6. Use recording rules - Pre-compute expensive queries

  7. Set proper retention - Balance between storage and usefulness

  8. Test in staging - Validate cardinality before production

  9. Document everything - Future you will thank present you

  10. Start simple, iterate - Don't over-instrument on day one

Prometheus is incredibly powerful when used correctly, but it's also easy to shoot yourself in the foot. These practices come from real incidents, real outages, and real lessons learned the hard way.

Follow them, and you'll sleep better at night.


Previous: Visualization with Grafana Back to: Prometheus 101 Series

Last updated