Prometheus Architecture: How the Pieces Fit Together

The Day I Understood the Pull Model

When I first started with Prometheus, I was confused. Every other monitoring system I'd used required me to push metrics from my application to a central server. I had to configure where to send metrics, handle network failures, implement retry logicβ€”it was a pain.

Then I discovered Prometheus's pull-based model, and it clicked. Instead of my application pushing metrics, Prometheus pulls them. My application just needs to expose an HTTP endpoint, and Prometheus does all the work of collecting metrics.

This inversion of control simplifies everything. My application doesn't need to know where Prometheus is, how many Prometheus servers exist, or what to do if Prometheus is down. It just exposes data and moves on.

Understanding this architecture changed how I think about monitoring.

The Core Components

Prometheus is more than just a metrics database. It's an ecosystem of components working together. Let's break down each piece.

spinner

1. Prometheus Server: The Heart

The Prometheus server is the core component. It:

Scrapes Metrics (Pulls Data):

  • Periodically fetches metrics from configured targets

  • Default interval: 15 seconds

  • Uses HTTP to pull from /metrics endpoints

Stores Time Series Data:

  • Local storage on disk (TSDB - Time Series Database)

  • Highly efficient storage format

  • Configurable retention (default: 15 days)

Evaluates Rules:

  • Recording rules: Pre-compute expensive queries

  • Alerting rules: Trigger alerts when conditions are met

Serves Queries:

  • HTTP API for querying data

  • PromQL query language

  • Powers Grafana dashboards and ad-hoc queries

2. Service Discovery

Prometheus needs to know what to scrape. You can configure targets in two ways:

Static Configuration:

Dynamic Service Discovery: Prometheus integrates with:

  • Kubernetes (my most-used option)

  • Docker/Docker Swarm

  • AWS EC2

  • Azure

  • Consul

  • DNS

  • And many more

For my Kubernetes deployments, Prometheus automatically discovers new pods:

When I deploy a new pod with the annotation prometheus.io/scrape: "true", Prometheus automatically starts scraping it. No manual configuration needed.

3. Pushgateway: The Exception to the Rule

The pull model works great for long-running services, but what about short-lived jobs? Batch jobs, cron jobs, or serverless functions might finish before Prometheus scrapes them.

That's where the Pushgateway comes in. It's a intermediary that allows short-lived jobs to push their metrics, which Prometheus then scrapes.

TypeScript Example:

Important: The Pushgateway is meant for batch jobs, not as a replacement for the pull model. I learned this the hard way when I tried using it for regular servicesβ€”it caused more problems than it solved.

4. Exporters: Monitoring Third-Party Systems

Exporters are small programs that expose metrics for systems that don't natively support Prometheus.

Common exporters I use:

Exporter
Purpose
Metrics Port

Node Exporter

Host metrics (CPU, memory, disk)

9100

PostgreSQL Exporter

Database metrics

9187

Redis Exporter

Redis metrics

9121

NGINX Exporter

NGINX metrics

9113

Blackbox Exporter

HTTP/TCP probing

9115

Example: PostgreSQL Exporter

Instead of instrumenting PostgreSQL itself, I run the PostgreSQL exporter as a sidecar:

Prometheus scrapes the exporter, which queries PostgreSQL and converts the data to Prometheus metrics.

5. Alertmanager: Intelligent Alert Routing

When an alert fires, you don't want to spam everyone. Alertmanager handles:

Grouping: Multiple similar alerts get grouped into one notification.

Inhibition: If a high-priority alert fires, suppress related low-priority alerts.

Silencing: Temporarily mute alerts during maintenance.

Routing: Send different alerts to different teams/channels.

My Alertmanager Configuration:

Critical alerts page me. Warnings go to Slack. This prevents alert fatigue while ensuring I don't miss critical issues.

The Pull Model: Deep Dive

Let's understand why the pull model is so powerful.

How Prometheus Discovers and Scrapes

spinner

1. Discovery Phase: Prometheus queries service discovery systems (Kubernetes API, Consul, etc.) to get a list of targets.

2. Scrape Phase: Every 15 seconds (configurable), Prometheus:

  • Makes an HTTP GET request to each target's /metrics endpoint

  • Parses the response (Prometheus text format)

  • Stores the time series in its TSDB

3. Your Application's Responsibility: Just expose metrics at /metrics. That's it.

Pull vs Push: Why Pull Wins

I've worked with both models. Here's why I prefer pull:

Advantages of Pull:

  1. Centralized Control:

    • Prometheus controls scrape frequency

    • Easy to adjust monitoring without changing apps

  2. Failure Detection:

    • If Prometheus can't scrape a target, it knows the service is down

    • Push model: if an app stops pushing, did it crash or are metrics just delayed?

  3. Simpler Application Code:

    • No need to configure where to send metrics

    • No retry logic needed

    • No network error handling

  4. Easy Testing:

    • curl http://localhost:3000/metrics shows your metrics

    • No need to run the entire monitoring stack for development

  5. Multiple Prometheus Servers:

    • Multiple Prometheus instances can scrape the same target

    • Useful for different regions or teams

When Push Makes Sense:

  • Short-lived jobs (use Pushgateway)

  • Behind firewalls (Prometheus can't reach the target)

  • Very high cardinality data that needs aggregation before storage

Data Storage: The Time Series Database

Prometheus stores metrics in a local time-series database (TSDB) on disk.

Storage Structure

How it works:

  1. Write-Ahead Log (WAL):

    • New samples written here first

    • Prevents data loss on crashes

  2. Head Block:

    • Recent data kept in memory for fast queries

    • Also persisted to disk

  3. 2-Hour Blocks:

    • Every 2 hours, the head block is compacted into an immutable block

    • Compressed and indexed for efficient storage and queries

  4. Compaction:

    • Blocks are progressively compacted into larger blocks

    • Reduces storage and improves query performance

Retention and Sizing

Default Retention: 15 days

Configure Retention:

Storage Estimation:

In my experience, storage usage depends on:

  • Number of time series (metric + label combinations)

  • Scrape interval

  • Retention period

Example calculation:

  • 10,000 time series

  • 15-second scrape interval

  • 30-day retention

  • ~12 bytes per sample (compressed)

In practice, I've seen much better compression, often 2-4 bytes per sample. A typical installation with 10,000 time series uses 20-50 GB for 30 days.

The Query Engine

Prometheus's query engine powers everything:

  • Grafana dashboards

  • Alert rule evaluation

  • Ad-hoc queries via the UI

PromQL Query Flow:

spinner

The engine:

  1. Parses your PromQL query

  2. Determines which time series to fetch

  3. Retrieves data from TSDB

  4. Applies functions and aggregations

  5. Returns results

We'll dive deep into PromQL in a dedicated article.

Recording Rules: Pre-Computing Expensive Queries

Some queries are expensive to run repeatedly. Recording rules let you pre-compute them.

Example:

Instead of running this expensive query every time:

Create a recording rule:

Now query the pre-computed metric:

This is much faster and reduces load on Prometheus.

Putting It All Together: My Production Setup

Here's how I structure Prometheus in production:

prometheus.yml:

Key Takeaways

  1. Pull-based architecture simplifies your application code

  2. Prometheus server handles scraping, storage, and queries

  3. Service discovery makes dynamic environments manageable

  4. Pushgateway is for batch jobs only

  5. Exporters monitor third-party systems

  6. Alertmanager provides intelligent alert routing

  7. TSDB efficiently stores time series data

  8. Recording rules pre-compute expensive queries

Understanding the architecture helps you:

  • Design better instrumentation

  • Troubleshoot monitoring issues

  • Scale Prometheus effectively

  • Choose the right components for your needs

In the next article, we'll get hands-on with instrumenting TypeScript applications using the prom-client library.


Previous: Understanding Metrics and Data Model Next: Instrumenting TypeScript Applications

Last updated