PromQL Basics: The Query Language That Changed How I Debug

The Query That Saved Production

It was a Friday afternoon (of course), and our API response times had mysteriously tripled. Users were complaining, but all my dashboards showed "green." Average response time: 120ms. Perfectly acceptable.

But I knew something was wrong. So I opened the Prometheus query interface and typed:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

The result: 8.5 seconds.

99% of requests were fine. But 1% were taking 8+ seconds. That's thousands of users per hour experiencing terrible performance, completely hidden by the average.

This single PromQL query revealed the problem in seconds. That's when I fell in love with PromQL.

What Is PromQL?

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time series data in real-time.

Think of it like SQL for metrics:

  • SQL queries rows in tables

  • PromQL queries time series data

But unlike SQL, PromQL is designed for real-time monitoring and alerting. Queries execute in milliseconds and can process millions of data points.

Your First PromQL Query

The simplest PromQL query is just a metric name:

This returns all time series for the http_requests_total metric. In my API, this might return:

Each line is a time series with its current value.

Label Selectors: Filtering Data

To filter time series, use label selectors:

Exact Match (=)

Returns only GET requests.

Negative Match (!=)

Returns all non-200 responses (errors).

Regex Match (=~)

Returns all 5xx errors (500, 501, 502, etc.).

Negative Regex (!~)

Excludes metrics and health check endpoints.

Combining Selectors

POST requests that resulted in 4xx or 5xx errors, excluding health checks.

Range Vectors vs Instant Vectors

This is crucial to understand:

Instant Vector

A set of time series containing a single sample for each at one point in time.

Returns current values.

Range Vector

A set of time series containing samples over a time range.

Returns all samples from the last 5 minutes.

Time durations:

  • s - seconds

  • m - minutes

  • h - hours

  • d - days

  • w - weeks

  • y - years

Examples:

  • [30s] - last 30 seconds

  • [5m] - last 5 minutes

  • [1h] - last hour

  • [7d] - last week

The rate() Function: Your Best Friend

Counters always increase, so their raw values aren't useful. The rate() function calculates the per-second average rate of increase.

Basic Usage

This calculates requests per second over the last 5 minutes.

Real example from my API:

Result:

Why [5m]?

I use 5-minute ranges because:

  • Long enough to smooth out spikes

  • Short enough to catch issues quickly

  • Works well with 15-second scrape intervals

Rule of thumb: Use at least 4x your scrape interval.

Aggregation Operators

Combine multiple time series into one.

sum() - Total Across All Series

Total requests per second across all endpoints.

sum() by - Group By Labels

Requests per second, grouped by endpoint:

sum() without - Exclude Labels

Sum across all status codes, keep other labels.

Other Aggregation Functions

Arithmetic Operators

Perform math on metrics.

Basic Arithmetic

Real Example: Error Rate by Endpoint

This query I use every day:

Returns error percentage per endpoint.

Comparison Operators

Filter based on values.

Essential Functions for TypeScript Apps

histogram_quantile() - Percentiles

The query that saved my Friday:

Understanding the result:

  • 95th percentile = 200ms β†’ 95% of requests complete in under 200ms

  • 99th percentile = 1.5s β†’ 99% complete in under 1.5s

increase() - Total Increase

irate() - Instant Rate

Like rate() but uses only the last two samples. More sensitive to spikes.

delta() - Difference for Gauges

deriv() - Derivative

Rate of change over time.

Positive = growing (potential leak), Negative = shrinking.

predict_linear() - Trend Prediction

Time Functions

Time Shifting with offset

Useful Queries I Use Daily

1. Request Rate by Endpoint

2. Error Rate Percentage

3. Top 5 Slowest Endpoints

4. Database Query Performance

5. Memory Usage Percentage

6. Active Connections

7. Request Success Rate

8. Traffic Breakdown by Status Code

Debugging Queries

When a query doesn't work, break it down:

Step 1: Start Simple

Does the metric exist?

Step 2: Add Filters

Are the labels correct?

Step 3: Add Range

Is there data in the time range?

Step 4: Add Functions

Does the function work?

Step 5: Aggregate

Common Mistakes I Made

1. Using rate() on Gauges

❌ Wrong:

βœ… Right:

2. Forgetting the Range Vector

❌ Wrong:

βœ… Right:

3. Short Time Ranges

❌ Wrong (with 15s scrape interval):

βœ… Right:

4. Aggregating Without by/without

❌ Wrong (loses important labels):

βœ… Right:

Testing Queries in Prometheus UI

Access the Prometheus expression browser:

Tips:

  1. Use the "Graph" tab to visualize trends

  2. Use the "Table" tab to see exact values

  3. Click the "Enable Query History" button

  4. Use Ctrl+Space for autocomplete

  5. Check "Use local time" for easier debugging

PromQL Cheat Sheet

Real Alert Queries

These are queries I actually use in production alerts:

Key Takeaways

  1. Start simple - Begin with metric name, add complexity gradually

  2. rate() is essential - Use it for all counters

  3. histogram_quantile() - Better than averages for latency

  4. by/without - Control label aggregation

  5. [5m] range - Good default for most queries

  6. Test in UI - Verify queries before using in dashboards/alerts

  7. Break down complex queries - Debug step by step

PromQL seems intimidating at first, but once you master these basics, you'll be able to answer any question about your application's behavior in seconds.

In the next article, we'll configure Prometheus to scrape your TypeScript applications and set up service discovery.


Previous: Instrumenting TypeScript Applications Next: Prometheus Configuration

Last updated