Alerting with Prometheus: Getting Woken Up Only When It Matters

The 3 AM Alert That Taught Me Everything

I'll never forget the week I first set up Prometheus alerting. I was so excited to have alerts that I created rules for everything. Memory above 50%? Alert. Single 500 error? Alert. Request latency above 100ms? Alert.

That first night, I received 47 alerts. Most were noise. My phone buzzed constantly. I couldn't sleep, not because of real problems, but because my alerts were poorly designed.

By morning, I was exhausted and had learned a critical lesson: good alerting isn't about detecting everythingβ€”it's about detecting what matters and ignoring what doesn't.

I spent the next day rewriting every alert rule. The result? I get 1-2 meaningful alerts per week instead of 50 meaningless ones per night.

Let me show you how to create alerts that help you sleep better, not worse.

The Philosophy of Good Alerting

Before writing any alert, ask:

  1. Is this actionable? Can I do something about it right now?

  2. Is this urgent? Does it need immediate attention, or can it wait until morning?

  3. Is this really a problem? Or just a symptom that will self-heal?

If you can't answer "yes" to all three, it shouldn't be a paging alert.

Alert Rules Basics

Alert rules are defined in YAML files and evaluated by Prometheus at regular intervals.

Basic Alert Structure

Key components:

  • alert: Name of the alert (must be unique)

  • expr: PromQL query that defines the condition

  • for: How long the condition must be true before firing

  • labels: Additional labels for routing and categorization

  • annotations: Human-readable information about the alert

My Production Alert Rules for TypeScript APIs

These are the actual alerts I use. They've been refined through real incidents.

1. High Error Rate

Why this works:

  • Grouped by route to identify specific endpoints

  • 5% threshold catches real problems, ignores noise

  • for: 5m prevents alerts on temporary spikes

  • Includes runbook link for on-call engineer

2. API Latency Too High

Key decisions:

  • 95th percentile, not average (catches tail latency)

  • 1 second threshold (adjust based on your SLA)

  • for: 10m because brief latency spikes are normal

  • Warning severity (annoying but not critical)

3. Service Down

Why only 1 minute:

  • Service being down is immediately critical

  • No need to wait long before alerting

  • This is a true emergency

4. High Memory Usage

Why 90% and 15 minutes:

  • Node.js GC is aggressive; brief spikes are normal

  • 15 minutes confirms sustained high usage

  • Warning level: serious but not immediately critical

5. Database Connection Pool Exhausted

Why this matters:

  • When connections run out, app can't query database

  • Early warning before complete failure

  • Critical because it directly impacts users

6. Slow Database Queries

7. Disk Space Running Out

8. Request Rate Anomaly (Comparison)

Why compare to 1 hour ago:

  • Detects sudden traffic drops

  • Could indicate users can't reach the service

  • Self-adjusts to traffic patterns

Alertmanager Configuration

Alertmanager receives alerts from Prometheus and routes them to the right people through the right channels.

Basic alertmanager.yml

Alert Routing Strategy

Here's how I route alerts:

spinner

Severity Guidelines:

Critical: Wake me up at 3 AM

  • Service completely down

  • Data loss occurring

  • Security breach

  • User-facing errors >5%

Warning: Tell me during work hours

  • Performance degradation

  • Resource usage high

  • Non-critical errors

  • Predictive alerts (disk will be full in 4 hours)

Info: Log it, I'll check when convenient

  • Deployment notifications

  • Configuration changes

  • Informational metrics

Silencing and Inhibition

Silencing During Maintenance

Inhibition Rules

Prevent redundant alerts:

Recording Rules for Alerts

For complex or expensive queries, use recording rules:

Benefits:

  • Faster alert evaluation

  • Consistent calculations across alerts and dashboards

  • Easier to understand alert rules

Testing Alerts

Always test before deploying!

1. Validate Syntax

2. Test Alert Query

3. Trigger Test Alert

Create a temporary rule with a condition that will fire:

4. Check Alertmanager

Common Alerting Mistakes I Fixed

Mistake 1: Alert Fatigue

❌ Too many alerts:

βœ… Better:

Mistake 2: Alerts Without Action

❌ Vague:

βœ… Actionable:

Mistake 3: Wrong for Duration

❌ Too short:

βœ… Right:

Mistake 4: Alert on Predictions

❌ Unnecessary complexity:

βœ… Simpler and clearer:

My Alert Checklist

Before adding any alert, I verify:

Complete Production Setup

Key Takeaways

  1. Less is more - Fewer, meaningful alerts beat many noisy ones

  2. Use for duration - Prevent alert fatigue from temporary spikes

  3. Clear annotations - Include runbook links and dashboards

  4. Route by severity - Critical alerts page, warnings go to Slack

  5. Test everything - Validate syntax and test queries

  6. Use recording rules - Pre-calculate complex metrics

  7. Silence during maintenance - Don't page yourself during deploys

  8. Review regularly - Tune thresholds based on real incidents

Good alerting isn't about catching every problemβ€”it's about catching the right problems at the right time and providing the information needed to fix them quickly.

In the next article, we'll visualize all this data with Grafana dashboards.


Previous: Prometheus Configuration Next: Visualization with Grafana

Last updated