Prometheus Configuration: From Localhost to Production

The Configuration Mistake That Cost Me Hours

When I first deployed Prometheus to production, I used the same configuration I had in development: static targets hardcoded in prometheus.yml. Everything worked fine—until we scaled to 5 API instances.

Suddenly, I had to manually update the config file every time we deployed or scaled. New instance? Edit the file, reload Prometheus. Instance goes down? Edit the file again. It was unsustainable.

That's when I discovered service discovery. One configuration change, and Prometheus automatically found all instances in Kubernetes. No more manual updates. No more stale targets.

This article will save you from making the same mistake.

The prometheus.yml File Structure

The prometheus.yml file is the heart of Prometheus configuration. Here's the basic structure:

# Global configuration
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
  scrape_timeout: 10s           # Timeout for scrape requests
  
  external_labels:
    cluster: 'production'       # Labels added to all metrics
    environment: 'prod'

# Alerting configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Rule files
rule_files:
  - 'alerts/*.yml'
  - 'recording_rules/*.yml'

# Scrape configurations
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'my-api'
    static_configs:
      - targets:
          - 'api-1:3000'
          - 'api-2:3000'

Let me break down each section with real examples from my setups.

Global Configuration

global:
  # Default scrape interval (can be overridden per job)
  scrape_interval: 15s
  
  # How often to evaluate alerting/recording rules
  evaluation_interval: 15s
  
  # Maximum time to wait for a scrape request
  scrape_timeout: 10s
  
  # Labels applied to all time series
  external_labels:
    cluster: 'production-us-east'
    datacenter: 'aws-us-east-1'
    environment: 'production'

Why 15 seconds?

Balance between granularity and overhead
Works well for most applications
Short enough to catch issues quickly
Long enough to not overwhelm targets

When to adjust:

Increase to 30s-60s: Low-priority metrics, cost reduction
Decrease to 5s-10s: Critical services, need fine granularity

Static Configuration: Simple Setup

Perfect for development and small deployments.

scrape_configs:
  # Prometheus monitoring itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # TypeScript API
  - job_name: 'api'
    scrape_interval: 10s  # Override global interval
    static_configs:
      - targets:
          - 'localhost:3000'
          - 'localhost:3001'
          - 'localhost:3002'
        labels:
          environment: 'development'
          tier: 'backend'
  
  # PostgreSQL exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']
        labels:
          database: 'main'
  
  # Redis exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']
        labels:
          cache: 'session'

Docker Compose Configuration

My typical development setup:

# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
  
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts:/etc/prometheus/alerts
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    depends_on:
      - api

volumes:
  prometheus-data:

# prometheus.yml for Docker Compose
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'api'
    static_configs:
      - targets: ['api:3000']  # Use service name, not localhost!

Important: In Docker Compose, use service names (e.g., api:3000), not localhost.

Kubernetes Service Discovery

This changed everything for me. No more manual configuration!

# prometheus.yml for Kubernetes
scrape_configs:
  # Scrape Kubernetes pods
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Get port from annotation prometheus.io/port
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      
      # Get metrics path from annotation prometheus.io/path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      
      # Add namespace as label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      
      # Add container name
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container

Kubernetes Deployment with Annotations:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-api
  template:
    metadata:
      labels:
        app: my-api
      annotations:
        prometheus.io/scrape: "true"   # Enable scraping
        prometheus.io/port: "3000"     # Metrics port
        prometheus.io/path: "/metrics" # Metrics path (default /metrics)
    spec:
      containers:
        - name: api
          image: my-api:latest
          ports:
            - containerPort: 3000
              name: http

Now when you deploy pods, Prometheus automatically discovers and scrapes them!

Relabeling: The Power Tool

Relabeling modifies labels before storing metrics. It's incredibly powerful.

Common Relabeling Patterns

1. Rename Labels:

relabel_configs:
  # Rename __meta_kubernetes_pod_name to instance
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: instance

2. Drop Unnecessary Labels:

metric_relabel_configs:
  # Drop high-cardinality user_id label
  - regex: 'user_id'
    action: labeldrop

3. Keep Only Specific Targets:

relabel_configs:
  # Only scrape production namespace
  - source_labels: [__meta_kubernetes_namespace]
    regex: 'production'
    action: keep

4. Drop Specific Targets:

relabel_configs:
  # Don't scrape test pods
  - source_labels: [__meta_kubernetes_pod_label_environment]
    regex: 'test'
    action: drop

5. Add Custom Labels:

relabel_configs:
  # Add region label based on node
  - source_labels: [__meta_kubernetes_node_name]
    regex: '.*-us-east-.*'
    target_label: region
    replacement: 'us-east'

My Production Kubernetes Configuration

This is what I actually use in production:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

alerting:
  alertmanagers:
    - kubernetes_sd_configs:
        - role: service
          namespaces:
            names:
              - monitoring
      relabel_configs:
        - source_labels: [__meta_kubernetes_service_name]
          regex: alertmanager
          action: keep

rule_files:
  - '/etc/prometheus/alerts/*.yml'
  - '/etc/prometheus/recording-rules/*.yml'

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
  
  # Kubernetes nodes (kubelet)
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
  
  # Kubernetes pods (applications)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Scrape only pods with prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Get metrics path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Get port
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # Add pod labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      
      # Add namespace
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      
      # Add pod name
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
  
  # Kubernetes services (for service-level metrics)
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance

Scrape Interval Tuning

Different jobs need different intervals:

scrape_configs:
  # Critical API - scrape frequently
  - job_name: 'api-critical'
    scrape_interval: 5s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_tier]
        regex: 'critical'
        action: keep
  
  # Regular services - normal interval
  - job_name: 'api-regular'
    scrape_interval: 15s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_tier]
        regex: 'standard'
        action: keep
  
  # Background jobs - slower interval
  - job_name: 'background-jobs'
    scrape_interval: 60s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_tier]
        regex: 'batch'
        action: keep

Configuration Best Practices from Experience

1. Use Meaningful Job Names

❌ Bad:

- job_name: 'job1'

✅ Good:

- job_name: 'api-backend'

2. Set Scrape Timeouts Properly

global:
  scrape_interval: 15s
  scrape_timeout: 10s  # Must be less than interval!

3. Use External Labels for Federated Setup

global:
  external_labels:
    cluster: 'us-east-1-prod'
    datacenter: 'aws'
    team: 'backend'

These labels are crucial when federating multiple Prometheus servers.

4. Organize Rule Files

rule_files:
  - '/etc/prometheus/alerts/api-alerts.yml'
  - '/etc/prometheus/alerts/db-alerts.yml'
  - '/etc/prometheus/recording-rules/api-rules.yml'

5. Enable Configuration Reload

Start Prometheus with --web.enable-lifecycle:

prometheus --config.file=prometheus.yml --web.enable-lifecycle

Then reload without restart:

curl -X POST http://localhost:9090/-/reload

Monitoring Prometheus Itself

Always monitor your monitoring system:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    metric_relabel_configs:
      # Keep important Prometheus metrics
      - source_labels: [__name__]
        regex: 'prometheus_(tsdb_.*|engine_.*|rule_.*|target_.*)'
        action: keep

Key Prometheus metrics to watch:

prometheus_tsdb_head_series - Number of time series
prometheus_target_scrapes_total - Total scrapes
prometheus_target_scrape_duration_seconds - Scrape duration
prometheus_rule_evaluation_duration_seconds - Rule evaluation time

Validating Configuration

Before deploying, validate your config:

# Check syntax
promtool check config prometheus.yml

# Check rules
promtool check rules alerts/*.yml

# Test queries
promtool query instant http://localhost:9090 'up'

Configuration File Management

How I organize configurations:

prometheus/
├── prometheus.yml          # Main config
├── alerts/
│   ├── api-alerts.yml
│   ├── database-alerts.yml
│   └── infrastructure-alerts.yml
├── recording-rules/
│   └── api-rules.yml
└── docker-compose.yml

Common Configuration Issues I've Fixed

Issue 1: Scrape Timeout Too Long

# WRONG - timeout equals interval
global:
  scrape_interval: 15s
  scrape_timeout: 15s  # Will fail!

# RIGHT
global:
  scrape_interval: 15s
  scrape_timeout: 10s

Issue 2: Wrong Target Address in Docker

# WRONG - localhost doesn't work in Docker
- targets: ['localhost:3000']

# RIGHT - use service name
- targets: ['api:3000']

Issue 3: Forgetting to Reload

After changing config:

# Don't just save the file - reload!
curl -X POST http://localhost:9090/-/reload

Key Takeaways

Start simple - Static configs for development, service discovery for production
Use Kubernetes annotations - Let Prometheus auto-discover targets
Tune scrape intervals - Different intervals for different priorities
Leverage relabeling - Clean up and organize labels
Validate before deploying - Use promtool check config
Enable lifecycle API - Reload without restart
Monitor Prometheus - Your monitoring needs monitoring too

The configuration file seems complex at first, but once you understand the patterns, it becomes your most powerful tool for controlling exactly what and how Prometheus monitors.

In the next article, we'll set up alerting—turning these metrics into actionable notifications.

Previous: PromQL Basics Next: Alerting with Prometheus

PreviousPromQL Basics: The Query Language That Changed How I Debug NextAlerting with Prometheus: Getting Woken Up Only When It Matters

Last updated 15 hours ago

hashtagThe Configuration Mistake That Cost Me Hours

hashtagThe prometheus.yml File Structure

hashtagGlobal Configuration

hashtagStatic Configuration: Simple Setup

hashtagDocker Compose Configuration

hashtagKubernetes Service Discovery

hashtagRelabeling: The Power Tool

hashtagCommon Relabeling Patterns

hashtagMy Production Kubernetes Configuration

hashtagScrape Interval Tuning

hashtagConfiguration Best Practices from Experience

hashtag1. Use Meaningful Job Names

hashtag2. Set Scrape Timeouts Properly

hashtag3. Use External Labels for Federated Setup

hashtag4. Organize Rule Files

hashtag5. Enable Configuration Reload

hashtagMonitoring Prometheus Itself

hashtagValidating Configuration

hashtagConfiguration File Management

hashtagCommon Configuration Issues I've Fixed

hashtagIssue 1: Scrape Timeout Too Long

hashtagIssue 2: Wrong Target Address in Docker

hashtagIssue 3: Forgetting to Reload

hashtagKey Takeaways