PromQL Basics: The Query Language That Changed How I Debug

The Query That Saved Production

It was a Friday afternoon (of course), and our API response times had mysteriously tripled. Users were complaining, but all my dashboards showed "green." Average response time: 120ms. Perfectly acceptable.

But I knew something was wrong. So I opened the Prometheus query interface and typed:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

The result: 8.5 seconds.

99% of requests were fine. But 1% were taking 8+ seconds. That's thousands of users per hour experiencing terrible performance, completely hidden by the average.

This single PromQL query revealed the problem in seconds. That's when I fell in love with PromQL.

What Is PromQL?

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time series data in real-time.

Think of it like SQL for metrics:

SQL queries rows in tables
PromQL queries time series data

But unlike SQL, PromQL is designed for real-time monitoring and alerting. Queries execute in milliseconds and can process millions of data points.

Your First PromQL Query

The simplest PromQL query is just a metric name:

http_requests_total

This returns all time series for the http_requests_total metric. In my API, this might return:

http_requests_total{method="GET", route="/api/users", status_code="200"} 1234
http_requests_total{method="GET", route="/api/users", status_code="500"} 5
http_requests_total{method="POST", route="/api/login", status_code="200"} 456
http_requests_total{method="POST", route="/api/login", status_code="401"} 23

Each line is a time series with its current value.

Label Selectors: Filtering Data

To filter time series, use label selectors:

Exact Match (=)

http_requests_total{method="GET"}

Returns only GET requests.

Negative Match (!=)

http_requests_total{status_code!="200"}

Returns all non-200 responses (errors).

Regex Match (=~)

http_requests_total{status_code=~"5.."}

Returns all 5xx errors (500, 501, 502, etc.).

Negative Regex (!~)

http_requests_total{route!~"/metrics|/health"}

Excludes metrics and health check endpoints.

Combining Selectors

http_requests_total{method="POST", status_code=~"4..|5..", route!="/health"}

POST requests that resulted in 4xx or 5xx errors, excluding health checks.

Range Vectors vs Instant Vectors

This is crucial to understand:

Instant Vector

A set of time series containing a single sample for each at one point in time.

http_requests_total

Returns current values.

Range Vector

A set of time series containing samples over a time range.

http_requests_total[5m]

Returns all samples from the last 5 minutes.

Time durations:

s - seconds
m - minutes
h - hours
d - days
w - weeks
y - years

Examples:

[30s] - last 30 seconds
[5m] - last 5 minutes
[1h] - last hour
[7d] - last week

The rate() Function: Your Best Friend

Counters always increase, so their raw values aren't useful. The rate() function calculates the per-second average rate of increase.

Basic Usage

rate(http_requests_total[5m])

This calculates requests per second over the last 5 minutes.

Real example from my API:

# Current traffic rate per endpoint
rate(http_requests_total[5m])

Result:

{method="GET", route="/api/users"} 12.5    # 12.5 requests/second
{method="POST", route="/api/login"} 3.2    # 3.2 requests/second

Why [5m]?

I use 5-minute ranges because:

Long enough to smooth out spikes
Short enough to catch issues quickly
Works well with 15-second scrape intervals

Rule of thumb: Use at least 4x your scrape interval.

Aggregation Operators

Combine multiple time series into one.

sum() - Total Across All Series

sum(rate(http_requests_total[5m]))

Total requests per second across all endpoints.

sum() by - Group By Labels

sum(rate(http_requests_total[5m])) by (route)

Requests per second, grouped by endpoint:

{route="/api/users"} 15.3
{route="/api/login"} 5.7
{route="/api/products"} 23.1

sum() without - Exclude Labels

sum(rate(http_requests_total[5m])) without (status_code)

Sum across all status codes, keep other labels.

Other Aggregation Functions

# Average response time per endpoint
avg(rate(http_request_duration_seconds_sum[5m])) by (route)

# Maximum memory usage across all instances
max(nodejs_heap_size_used_bytes) by (instance)

# Minimum available connections
min(db_connections_idle)

# Count number of instances
count(up)

# Standard deviation
stddev(http_request_duration_seconds)

Arithmetic Operators

Perform math on metrics.

Basic Arithmetic

# Error rate as percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Memory usage percentage
nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes * 100

# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Real Example: Error Rate by Endpoint

This query I use every day:

sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
  /
sum(rate(http_requests_total[5m])) by (route)
  * 100

Returns error percentage per endpoint.

Comparison Operators

Filter based on values.

# APIs responding slower than 1 second
http_request_duration_seconds > 1

# High error rate endpoints (>1%)
(
  sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
    /
  sum(rate(http_requests_total[5m])) by (route)
) > 0.01

# Low disk space (<10%)
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1

Essential Functions for TypeScript Apps

histogram_quantile() - Percentiles

The query that saved my Friday:

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile by endpoint
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
)

# 50th percentile (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

Understanding the result:

95th percentile = 200ms → 95% of requests complete in under 200ms
99th percentile = 1.5s → 99% complete in under 1.5s

increase() - Total Increase

# Total requests in the last hour
increase(http_requests_total[1h])

# Total errors in the last 24 hours
increase(http_requests_total{status_code=~"5.."}[24h])

irate() - Instant Rate

Like rate() but uses only the last two samples. More sensitive to spikes.

# Instant request rate (sensitive to spikes)
irate(http_requests_total[5m])

# I prefer rate() for most cases - it's more stable

delta() - Difference for Gauges

# Memory change in the last 5 minutes
delta(nodejs_heap_size_used_bytes[5m])

# Connection pool size change
delta(db_connections_active[10m])

deriv() - Derivative

Rate of change over time.

# Is memory growing?
deriv(nodejs_heap_size_used_bytes[1h])

Positive = growing (potential leak), Negative = shrinking.

predict_linear() - Trend Prediction

# Predicted disk usage in 4 hours
predict_linear(node_filesystem_avail_bytes[1h], 4*3600)

# Will we run out of disk?
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0

Time Functions

Time Shifting with offset

# Current request rate
rate(http_requests_total[5m])

# Request rate 1 hour ago
rate(http_requests_total[5m] offset 1h)

# Compare to yesterday
rate(http_requests_total[5m] offset 24h)

# Traffic increase compared to yesterday
(
  rate(http_requests_total[5m])
    -
  rate(http_requests_total[5m] offset 24h)
) / rate(http_requests_total[5m] offset 24h) * 100

Useful Queries I Use Daily

1. Request Rate by Endpoint

sum(rate(http_requests_total[5m])) by (route)

2. Error Rate Percentage

sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))
  * 100

3. Top 5 Slowest Endpoints

topk(5,
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
  )
)

4. Database Query Performance

# Average query duration by table
rate(db_query_duration_seconds_sum[5m])
  /
rate(db_query_duration_seconds_count[5m])

# Queries per second by table
sum(rate(db_query_duration_seconds_count[5m])) by (table)

5. Memory Usage Percentage

nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes * 100

6. Active Connections

db_connections_active

7. Request Success Rate

sum(rate(http_requests_total{status_code=~"2.."}[5m]))
  /
sum(rate(http_requests_total[5m]))
  * 100

8. Traffic Breakdown by Status Code

sum(rate(http_requests_total[5m])) by (status_code)

Debugging Queries

When a query doesn't work, break it down:

Step 1: Start Simple

http_requests_total

Does the metric exist?

Step 2: Add Filters

http_requests_total{route="/api/users"}

Are the labels correct?

Step 3: Add Range

http_requests_total{route="/api/users"}[5m]

Is there data in the time range?

Step 4: Add Functions

rate(http_requests_total{route="/api/users"}[5m])

Does the function work?

Step 5: Aggregate

sum(rate(http_requests_total{route="/api/users"}[5m]))

Common Mistakes I Made

1. Using rate() on Gauges

❌ Wrong:

rate(db_connections_active[5m])

✅ Right:

db_connections_active  # Gauges don't need rate()

2. Forgetting the Range Vector

❌ Wrong:

rate(http_requests_total)

✅ Right:

rate(http_requests_total[5m])

3. Short Time Ranges

❌ Wrong (with 15s scrape interval):

rate(http_requests_total[30s])  # Only 2 samples!

✅ Right:

rate(http_requests_total[5m])  # 20 samples

4. Aggregating Without by/without

❌ Wrong (loses important labels):

sum(rate(http_requests_total[5m]))

✅ Right:

sum(rate(http_requests_total[5m])) by (route, status_code)

Testing Queries in Prometheus UI

Access the Prometheus expression browser:

http://localhost:9090/graph

Tips:

Use the "Graph" tab to visualize trends
Use the "Table" tab to see exact values
Click the "Enable Query History" button
Use Ctrl+Space for autocomplete
Check "Use local time" for easier debugging

PromQL Cheat Sheet

# Selectors
metric_name{label="value"}
metric_name{label=~"regex"}
metric_name{label!="value"}
metric_name{label!~"regex"}

# Range vectors
metric_name[5m]
metric_name[1h]

# Rates
rate(counter[5m])
irate(counter[5m])
increase(counter[1h])

# Aggregation
sum(metric) by (label)
avg(metric) by (label)
max(metric)
min(metric)
count(metric)

# Math
metric * 100
metric1 / metric2
metric1 + metric2
metric1 - metric2

# Functions
histogram_quantile(0.95, metric)
delta(gauge[5m])
predict_linear(gauge[1h], 3600)
abs(metric)
ceil(metric)
floor(metric)
round(metric)

# Time
metric offset 1h
metric offset 1d

# Filtering
metric > 100
metric < 0.5
metric != 0

Real Alert Queries

These are queries I actually use in production alerts:

# High error rate
(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    /
  sum(rate(http_requests_total[5m]))
) > 0.01  # 1% error rate

# Slow requests
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
) > 1  # 95th percentile > 1 second

# High memory usage
nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes > 0.9  # 90% used

# Database connection pool exhausted
db_connections_idle < 2

# Instance down
up == 0

Key Takeaways

Start simple - Begin with metric name, add complexity gradually
rate() is essential - Use it for all counters
histogram_quantile() - Better than averages for latency
by/without - Control label aggregation
[5m] range - Good default for most queries
Test in UI - Verify queries before using in dashboards/alerts
Break down complex queries - Debug step by step

PromQL seems intimidating at first, but once you master these basics, you'll be able to answer any question about your application's behavior in seconds.

In the next article, we'll configure Prometheus to scrape your TypeScript applications and set up service discovery.

Previous: Instrumenting TypeScript Applications Next: Prometheus Configuration

PreviousInstrumenting TypeScript Applications: From Zero to Production Metrics NextPrometheus Configuration: From Localhost to Production

Last updated 15 hours ago

hashtagThe Query That Saved Production

hashtagWhat Is PromQL?

hashtagYour First PromQL Query

hashtagLabel Selectors: Filtering Data

hashtagExact Match (=)

hashtagNegative Match (!=)

hashtagRegex Match (=~)

hashtagNegative Regex (!~)

hashtagCombining Selectors

hashtagRange Vectors vs Instant Vectors

hashtagInstant Vector

hashtagRange Vector

hashtagThe rate() Function: Your Best Friend

hashtagBasic Usage

hashtagWhy [5m]?

hashtagAggregation Operators

hashtagsum() - Total Across All Series

hashtagsum() by - Group By Labels

hashtagsum() without - Exclude Labels

hashtagOther Aggregation Functions

hashtagArithmetic Operators

hashtagBasic Arithmetic

hashtagReal Example: Error Rate by Endpoint

hashtagComparison Operators

hashtagEssential Functions for TypeScript Apps

hashtaghistogram_quantile() - Percentiles

hashtagincrease() - Total Increase

hashtagirate() - Instant Rate

hashtagdelta() - Difference for Gauges

hashtagderiv() - Derivative

hashtagpredict_linear() - Trend Prediction

hashtagTime Functions

hashtagTime Shifting with offset

hashtagUseful Queries I Use Daily

hashtag1. Request Rate by Endpoint

hashtag2. Error Rate Percentage

hashtag3. Top 5 Slowest Endpoints

hashtag4. Database Query Performance

hashtag5. Memory Usage Percentage

hashtag6. Active Connections

hashtag7. Request Success Rate

hashtag8. Traffic Breakdown by Status Code

hashtagDebugging Queries

hashtagStep 1: Start Simple

hashtagStep 2: Add Filters

hashtagStep 3: Add Range

hashtagStep 4: Add Functions

hashtagStep 5: Aggregate

hashtagCommon Mistakes I Made

hashtag1. Using rate() on Gauges

hashtag2. Forgetting the Range Vector

hashtag3. Short Time Ranges

hashtag4. Aggregating Without by/without

hashtagTesting Queries in Prometheus UI

hashtagPromQL Cheat Sheet

hashtagReal Alert Queries

hashtagKey Takeaways

The Query That Saved Production

What Is PromQL?

Your First PromQL Query

Label Selectors: Filtering Data

Exact Match (=)

Negative Match (!=)

Regex Match (=~)

Negative Regex (!~)

Combining Selectors

Range Vectors vs Instant Vectors

Instant Vector

Range Vector

The rate() Function: Your Best Friend

Basic Usage

Why [5m]?

Aggregation Operators

sum() - Total Across All Series

sum() by - Group By Labels

sum() without - Exclude Labels

Other Aggregation Functions

Arithmetic Operators

Basic Arithmetic

Real Example: Error Rate by Endpoint

Comparison Operators

Essential Functions for TypeScript Apps

histogram_quantile() - Percentiles

increase() - Total Increase

irate() - Instant Rate

delta() - Difference for Gauges

deriv() - Derivative

predict_linear() - Trend Prediction

Time Functions

Time Shifting with offset

Useful Queries I Use Daily

1. Request Rate by Endpoint

2. Error Rate Percentage

3. Top 5 Slowest Endpoints

4. Database Query Performance

5. Memory Usage Percentage

6. Active Connections

7. Request Success Rate

8. Traffic Breakdown by Status Code

Debugging Queries

Step 1: Start Simple

Step 2: Add Filters

Step 3: Add Range

Step 4: Add Functions

Step 5: Aggregate

Common Mistakes I Made

1. Using rate() on Gauges

2. Forgetting the Range Vector

3. Short Time Ranges

4. Aggregating Without by/without

Testing Queries in Prometheus UI

PromQL Cheat Sheet

Real Alert Queries

Key Takeaways