Prometheus Best Practices: Lessons from Production

The Cardinality Explosion That Took Down Prometheus

I'll never forget the day our Prometheus server ran out of memory and crashed. We were monitoring a few thousand time series when suddenly, overnight, we had millions. Queries timed out. Dashboards wouldn't load. Alerts stopped firing. We were completely blind.

The culprit? One innocent-looking line of code:

requestCounter.inc({ user_id: req.user.id });

We had added user_id as a label. With 100,000 users and 20 endpoints, we created 2 million unique time series overnight. Prometheus couldn't handle it.

I learned an expensive lesson: label cardinality matters more than almost anything else in Prometheus.

This article is everything I wish I knew before running Prometheus in production.

The Cardinal Rule: Control Label Cardinality

What Is Cardinality?

Cardinality is the number of unique combinations of labels for a metric.

Example:

http_requests_total{method="GET", endpoint="/api/users", status_code="200"}

If you have:

3 methods (GET, POST, DELETE)
10 endpoints
6 status codes (200, 201, 400, 401, 404, 500)

Cardinality = 3 × 10 × 6 = 180 time series

Manageable.

But if you add user_id:

3 methods
10 endpoints
6 status codes
100,000 users

Cardinality = 3 × 10 × 6 × 100,000 = 18,000,000 time series

Disaster.

High-Cardinality Labels to AVOID

❌ Never use these as labels:

User IDs
Session IDs
Request IDs
Email addresses
IP addresses (unless aggregated)
Timestamps
UUIDs
File paths (full paths)
URLs (full URLs with query params)

Low-Cardinality Labels to USE

✅ Good labels:

HTTP method (GET, POST, etc.) - ~10 values
HTTP status code (200, 404, 500, etc.) - ~20 values
Endpoint/route (/api/users, /api/products) - 10-100 values
Environment (dev, staging, prod) - 2-5 values
Service name (api, worker, scheduler) - 5-20 values
Instance/host - 10-100 values
Database table name - 10-50 values
User role (admin, user, guest) - 3-10 values
Region (us-east, us-west, eu) - 3-10 values

How to Handle High-Cardinality Data

Option 1: Don't track it at all

Often, you don't need per-user metrics. Aggregate data is usually sufficient.

Option 2: Use logs, not metrics

// Instead of this:
errorCounter.inc({ user_id: userId, error_type: errorType });

// Do this:
errorCounter.inc({ error_type: errorType });
logger.error('Error occurred', { userId, errorType, details });

Metrics for aggregates, logs for specifics.

Option 3: Use exemplars (Prometheus 2.26+)

Exemplars let you link metrics to traces without increasing cardinality:

histogram.observe({ route: '/api/users' }, duration, {
  traceId: req.traceId  // Exemplar, not a label
});

Option 4: Aggregate before exposing

Instead of tracking each user, track user buckets:

// Bad: one series per user
userActivityGauge.set({ user_id: userId }, activityScore);

// Good: aggregate by activity level
const level = activityScore > 100 ? 'high' : activityScore > 10 ? 'medium' : 'low';
usersByActivityLevel.inc({ level });

Metric Naming Conventions

After years of Prometheus, these conventions have saved me countless times.

The Standard Format

<namespace>_<subsystem>_<name>_<unit>

Examples:

http_requests_total
http_request_duration_seconds
db_query_duration_seconds
db_connections_active
nodejs_memory_usage_bytes
cache_hits_total
queue_size_items

Rules I Always Follow

Use snake_case, not camelCase

✅ http_requests_total
❌ httpRequestsTotal

Suffix counters with _total

✅ http_requests_total
✅ errors_total
❌ http_requests

Include units in the name

✅ http_request_duration_seconds
✅ memory_usage_bytes
❌ http_request_duration  (what unit?)

Use base units

Time: seconds (not milliseconds)
Data: bytes (not kilobytes)
Percentage: 0-1 ratio (not 0-100)

✅ duration_seconds  (0.250 for 250ms)
❌ duration_milliseconds  (250)

✅ memory_bytes  (1048576 for 1MB)
❌ memory_megabytes  (1)

✅ error_rate  (0.05 for 5%)
❌ error_percentage  (5)

Group related metrics with prefixes

http_requests_total
http_request_duration_seconds
http_request_size_bytes
http_response_size_bytes

db_query_duration_seconds
db_connections_active
db_connections_idle
db_errors_total

My Naming Patterns

HTTP Metrics:

http_requests_total{method, route, status_code}
http_request_duration_seconds{method, route, status_code}
http_request_size_bytes{method, route}
http_response_size_bytes{method, route, status_code}

Database Metrics:

db_query_duration_seconds{operation, table}
db_connections_active
db_connections_idle
db_connections_max
db_errors_total{operation, table}

Business Metrics:

user_signups_total{source}
user_logins_total
purchases_total{product_type}
revenue_total_cents{currency, product_type}

Retention and Storage Management

Setting Appropriate Retention

Default retention is 15 days. I adjust based on needs:

# 30 days for production
--storage.tsdb.retention.time=30d

# 7 days for development
--storage.tsdb.retention.time=7d

# Or limit by size
--storage.tsdb.retention.size=50GB

My retention strategy:

Production: 30 days (catch monthly trends)
Staging: 7 days (enough for debugging)
Development: 3 days (minimal needs)

Storage Sizing

Rough calculation:

storage (bytes) = time_series × samples_per_second × retention_seconds × bytes_per_sample

Real example:

10,000 time series
Scrape every 15s = 4 samples/minute = 0.067 samples/second
30-day retention = 2,592,000 seconds
~2 bytes per sample (compressed)

10,000 × 0.067 × 2,592,000 × 2 ≈ 3.5 GB

In practice, expect 3-6 GB for 10,000 series over 30 days.

Monitoring Prometheus Itself

Always monitor your monitoring:

# Number of time series
prometheus_tsdb_head_series

# Storage size
prometheus_tsdb_storage_blocks_bytes

# Scrape duration
prometheus_target_scrape_duration_seconds

# Failed scrapes
rate(prometheus_target_scrapes_failed_total[5m])

# Rule evaluation time
prometheus_rule_evaluation_duration_seconds

Alert when Prometheus struggles:

- alert: PrometheusHighCardinality
  expr: prometheus_tsdb_head_series > 100000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High number of time series"
    description: "{{ $value }} series (consider reducing cardinality)"

- alert: PrometheusSlowScrapes
  expr: prometheus_target_scrape_duration_seconds > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Slow metric scrapes"
    description: "Scraping takes {{ $value }}s (should be < 5s)"

Scraping Best Practices

Scrape Intervals

Default: 15 seconds

This works for most cases. Adjust only when necessary:

# Critical service - more frequent
- job_name: 'api-critical'
  scrape_interval: 5s

# Background jobs - less frequent
- job_name: 'batch-jobs'
  scrape_interval: 60s

# Infrastructure - standard
- job_name: 'node-exporter'
  scrape_interval: 15s

Rule of thumb: Use the longest interval that meets your needs.

Scrape Timeout

global:
  scrape_timeout: 10s  # Must be < scrape_interval

Keep /metrics Fast

Your /metrics endpoint should respond in milliseconds:

// ❌ Bad: Heavy computation in metrics handler
app.get('/metrics', async (req, res) => {
  const userCount = await db.query('SELECT COUNT(*) FROM users');  // Slow!
  userGauge.set(userCount.rows[0].count);
  res.send(await register.metrics());
});

// ✅ Good: Update metrics asynchronously
setInterval(async () => {
  const userCount = await db.query('SELECT COUNT(*) FROM users');
  userGauge.set(userCount.rows[0].count);
}, 60000);  // Update every minute

app.get('/metrics', async (req, res) => {
  res.send(await register.metrics());  // Fast!
});

Scrape Only What You Need

Don't expose every possible metric:

// ❌ Bad: Exposing unnecessary internal metrics
const internalCounter = new Counter({
  name: 'internal_function_calls_total',
  help: 'Debug metric not needed in production',
  registers: [register]
});

// ✅ Good: Only expose production-relevant metrics
// Use feature flags or environment checks
if (process.env.NODE_ENV === 'production') {
  // Only register production metrics
}

Recording Rules: When and How

Recording rules pre-compute expensive queries.

When to Use Recording Rules

✅ Use for:

Expensive aggregations used in dashboards
Queries used in multiple alerts
Complex histogram calculations
Frequently accessed data

❌ Don't use for:

Simple queries (overhead not worth it)
Rarely used metrics
Already fast queries

My Production Recording Rules

groups:
  - name: api_performance
    interval: 30s
    rules:
      # Pre-calculate error rate
      - record: api:error_rate:5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
            /
          sum(rate(http_requests_total[5m])) by (route)
      
      # Pre-calculate P95 latency
      - record: api:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
          )
      
      # Pre-calculate request rate
      - record: api:request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (route)

# Now use in alerts
- alert: HighErrorRate
  expr: api:error_rate:5m > 0.05  # Much faster!
  for: 5m

Naming Convention for Recording Rules

level:metric:aggregation

Examples:

api:request_rate:5m
instance:cpu_usage:avg
cluster:memory_usage:sum

Label Best Practices

Keep Label Sets Consistent

All metrics in a family should have the same labels:

// ✅ Good: Consistent labels
http_requests_total{method, route, status_code}
http_request_duration_seconds{method, route, status_code}
http_request_size_bytes{method, route}  // Subset is OK

// ❌ Bad: Inconsistent
http_requests_total{method, route, status_code}
http_request_duration_seconds{method, endpoint, code}  // Different names!

Use Meaningful Label Values

// ❌ Bad: Unclear values
{status: "ok"}
{status: "bad"}

// ✅ Good: Clear values
{status: "success"}
{status: "error"}

// ❌ Bad: Abbreviated
{env: "prd"}
{env: "stg"}

// ✅ Good: Full words
{environment: "production"}
{environment: "staging"}

Don't Include Label Names in Values

// ❌ Bad: Redundant
{method: "method:GET"}
{route: "route:/api/users"}

// ✅ Good: Clean
{method: "GET"}
{route: "/api/users"}

Performance Optimization

Limit Label Cardinality Per Metric

My thresholds:

< 100 series: Perfect
100-1,000 series: Good
1,000-10,000 series: Monitor closely
10,000+ series: Needs optimization

Use `rate()` Correctly

Always use rate() for counters, never for gauges:

# ✅ Correct
rate(http_requests_total[5m])

# ❌ Wrong
rate(cpu_usage_percent[5m])  # cpu_usage_percent is a gauge!

Choose Appropriate Time Ranges

# ❌ Too short (only 2 samples with 15s scrape)
rate(http_requests_total[30s])

# ✅ Good (20 samples)
rate(http_requests_total[5m])

# ❌ Unnecessarily long for alerts
rate(http_requests_total[1h])

Rule: Use at least 4× your scrape interval.

Aggregate Before Histogram Quantiles

# ❌ Slow: aggregation after quantile
avg(
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# ✅ Fast: quantile after aggregation
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Security Best Practices

Don't Expose Sensitive Data

// ❌ Bad: Exposing secrets
const apiKeyUsage = new Counter({
  name: 'api_key_usage_total',
  labelNames: ['api_key']  // Don't expose actual keys!
});

// ✅ Good: Use hashes or IDs
const apiKeyUsage = new Counter({
  name: 'api_key_usage_total',
  labelNames: ['key_id']  // Reference to key, not the key itself
});

Protect the /metrics Endpoint

// Basic auth for metrics endpoint
app.get('/metrics', (req, res, next) => {
  const auth = req.headers.authorization;
  
  if (process.env.NODE_ENV === 'production') {
    if (!auth || auth !== `Bearer ${process.env.METRICS_TOKEN}`) {
      return res.status(401).send('Unauthorized');
    }
  }
  
  res.send(await register.metrics());
});

Or use network policies in Kubernetes to restrict access.

Common Mistakes and Fixes

Mistake 1: Using Gauges for Counters

// ❌ Wrong: Using gauge for cumulative value
const requestGauge = new Gauge({
  name: 'http_requests',
  help: 'Number of requests'
});
app.use((req, res, next) => {
  requestGauge.inc();  // Gauge, but counter behavior!
  next();
});

// ✅ Right: Use counter
const requestCounter = new Counter({
  name: 'http_requests_total',
  help: 'Total number of requests'
});

Mistake 2: Not Using Histogram Buckets Wisely

// ❌ Bad: Generic buckets
buckets: [0.1, 1, 10, 100]  // Not suited for API latency

// ✅ Good: Buckets matching actual latency distribution
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]

Mistake 3: Instrumenting Everything

Not every function call needs a metric. Focus on:

User-facing operations
External API calls
Database queries
Critical business operations

Mistake 4: Forgetting `for` in Alerts

# ❌ Bad: Fires on single scrape
- alert: HighErrorRate
  expr: error_rate > 0.05

# ✅ Good: Sustained issue
- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m

My Production Checklist

Before deploying Prometheus to production:

Key Takeaways

Control label cardinality - This is the #1 cause of Prometheus issues
Name metrics consistently - Follow conventions religiously
Use appropriate metric types - Counter for cumulative, gauge for snapshots, histogram for distributions
Keep /metrics fast - No expensive operations in the handler
Monitor Prometheus - Your monitoring needs monitoring
Use recording rules - Pre-compute expensive queries
Set proper retention - Balance between storage and usefulness
Test in staging - Validate cardinality before production
Document everything - Future you will thank present you
Start simple, iterate - Don't over-instrument on day one

Prometheus is incredibly powerful when used correctly, but it's also easy to shoot yourself in the foot. These practices come from real incidents, real outages, and real lessons learned the hard way.

Follow them, and you'll sleep better at night.

Previous: Visualization with Grafana Back to: Prometheus 101 Series

PreviousVisualization with Grafana: Making Metrics Beautiful and Useful NextKQL 101

Last updated 15 hours ago

hashtagThe Cardinality Explosion That Took Down Prometheus

hashtagThe Cardinal Rule: Control Label Cardinality

hashtagWhat Is Cardinality?

hashtagHigh-Cardinality Labels to AVOID

hashtagLow-Cardinality Labels to USE

hashtagHow to Handle High-Cardinality Data

hashtagMetric Naming Conventions

hashtagThe Standard Format

hashtagRules I Always Follow

hashtagMy Naming Patterns

hashtagRetention and Storage Management

hashtagSetting Appropriate Retention

hashtagStorage Sizing

hashtagMonitoring Prometheus Itself

hashtagScraping Best Practices

hashtagScrape Intervals

hashtagScrape Timeout

hashtagKeep /metrics Fast

hashtagScrape Only What You Need

hashtagRecording Rules: When and How

hashtagWhen to Use Recording Rules

hashtagMy Production Recording Rules

hashtagNaming Convention for Recording Rules

hashtagLabel Best Practices

hashtagKeep Label Sets Consistent

hashtagUse Meaningful Label Values

hashtagDon't Include Label Names in Values

hashtagPerformance Optimization

hashtagLimit Label Cardinality Per Metric

hashtagUse rate() Correctly

hashtagChoose Appropriate Time Ranges

hashtagAggregate Before Histogram Quantiles

hashtagSecurity Best Practices

hashtagDon't Expose Sensitive Data

hashtagProtect the /metrics Endpoint

hashtagCommon Mistakes and Fixes

hashtagMistake 1: Using Gauges for Counters

hashtagMistake 2: Not Using Histogram Buckets Wisely

hashtagMistake 3: Instrumenting Everything

hashtagMistake 4: Forgetting for in Alerts

hashtagMy Production Checklist

hashtagKey Takeaways