Alerting with Prometheus: Getting Woken Up Only When It Matters

The 3 AM Alert That Taught Me Everything

I'll never forget the week I first set up Prometheus alerting. I was so excited to have alerts that I created rules for everything. Memory above 50%? Alert. Single 500 error? Alert. Request latency above 100ms? Alert.

That first night, I received 47 alerts. Most were noise. My phone buzzed constantly. I couldn't sleep, not because of real problems, but because my alerts were poorly designed.

By morning, I was exhausted and had learned a critical lesson: good alerting isn't about detecting everything—it's about detecting what matters and ignoring what doesn't.

I spent the next day rewriting every alert rule. The result? I get 1-2 meaningful alerts per week instead of 50 meaningless ones per night.

Let me show you how to create alerts that help you sleep better, not worse.

The Philosophy of Good Alerting

Before writing any alert, ask:

Is this actionable? Can I do something about it right now?
Is this urgent? Does it need immediate attention, or can it wait until morning?
Is this really a problem? Or just a symptom that will self-heal?

If you can't answer "yes" to all three, it shouldn't be a paging alert.

Alert Rules Basics

Alert rules are defined in YAML files and evaluated by Prometheus at regular intervals.

Basic Alert Structure

groups:
  - name: api_alerts
    interval: 30s  # How often to evaluate rules
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
            /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

Key components:

alert: Name of the alert (must be unique)
expr: PromQL query that defines the condition
for: How long the condition must be true before firing
labels: Additional labels for routing and categorization
annotations: Human-readable information about the alert

My Production Alert Rules for TypeScript APIs

These are the actual alerts I use. They've been refined through real incidents.

1. High Error Rate

groups:
  - name: api_errors
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
              /
            sum(rate(http_requests_total[5m])) by (route)
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.route }}"
          description: "{{ $labels.route }} has {{ $value | humanizePercentage }} error rate (>5%)"
          runbook: "https://wiki.company.com/runbooks/high-error-rate"

Why this works:

Grouped by route to identify specific endpoints
5% threshold catches real problems, ignores noise
for: 5m prevents alerts on temporary spikes
Includes runbook link for on-call engineer

2. API Latency Too High

- alert: HighLatency
  expr: |
    histogram_quantile(0.95,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
    ) > 1
  for: 10m
  labels:
    severity: warning
    team: backend
  annotations:
    summary: "High latency on {{ $labels.route }}"
    description: "95th percentile latency is {{ $value }}s on {{ $labels.route }}"
    dashboard: "https://grafana.company.com/d/api-performance"

Key decisions:

95th percentile, not average (catches tail latency)
1 second threshold (adjust based on your SLA)
for: 10m because brief latency spikes are normal
Warning severity (annoying but not critical)

3. Service Down

- alert: ServiceDown
  expr: up == 0
  for: 1m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "Service {{ $labels.job }} is down"
    description: "{{ $labels.instance }} has been down for more than 1 minute"
    impact: "Users cannot access the service"

Why only 1 minute:

Service being down is immediately critical
No need to wait long before alerting
This is a true emergency

4. High Memory Usage

- alert: HighMemoryUsage
  expr: |
    (
      nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes
    ) > 0.90
  for: 15m
  labels:
    severity: warning
    team: backend
  annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is {{ $value | humanizePercentage }}"
    action: "Check for memory leaks or increase heap size"

Why 90% and 15 minutes:

Node.js GC is aggressive; brief spikes are normal
15 minutes confirms sustained high usage
Warning level: serious but not immediately critical

5. Database Connection Pool Exhausted

- alert: DatabaseConnectionPoolLow
  expr: db_connections_idle < 2
  for: 5m
  labels:
    severity: critical
    team: backend
  annotations:
    summary: "Database connection pool nearly exhausted"
    description: "Only {{ $value }} idle connections remaining"
    action: "Check for connection leaks or increase pool size"

Why this matters:

When connections run out, app can't query database
Early warning before complete failure
Critical because it directly impacts users

6. Slow Database Queries

- alert: SlowDatabaseQueries
  expr: |
    histogram_quantile(0.99,
      sum(rate(db_query_duration_seconds_bucket[5m])) by (le, table, operation)
    ) > 1
  for: 10m
  labels:
    severity: warning
    team: backend
  annotations:
    summary: "Slow queries on {{ $labels.table }}"
    description: "99th percentile {{ $labels.operation }} time is {{ $value }}s"
    action: "Check indexes and query plans"

7. Disk Space Running Out

- alert: DiskSpaceLow
  expr: |
    (
      node_filesystem_avail_bytes{mountpoint="/"}
        /
      node_filesystem_size_bytes{mountpoint="/"}
    ) < 0.10
  for: 15m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Low disk space on {{ $labels.instance }}"
    description: "Only {{ $value | humanizePercentage }} disk space remaining"
    action: "Clean up logs or expand disk"

8. Request Rate Anomaly (Comparison)

- alert: RequestRateDropped
  expr: |
    sum(rate(http_requests_total[5m]))
      /
    sum(rate(http_requests_total[5m] offset 1h))
    < 0.5
  for: 10m
  labels:
    severity: warning
    team: backend
  annotations:
    summary: "Request rate dropped significantly"
    description: "Current rate is {{ $value | humanizePercentage }} of rate 1 hour ago"
    action: "Check if traffic is being blocked or users can't reach the service"

Why compare to 1 hour ago:

Detects sudden traffic drops
Could indicate users can't reach the service
Self-adjusts to traffic patterns

Alertmanager Configuration

Alertmanager receives alerts from Prometheus and routes them to the right people through the right channels.

Basic alertmanager.yml

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty'
      group_wait: 30s
      repeat_interval: 5m
    
    # Warnings go to Slack
    - match:
        severity: warning
      receiver: 'slack'
      repeat_interval: 4h
    
    # Database team alerts
    - match_re:
        alertname: ^Database.*
      receiver: 'database-team'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .GroupLabels.alertname }}'
  
  - name: 'slack'
    slack_configs:
      - channel: '#warnings'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
  
  - name: 'database-team'
    slack_configs:
      - channel: '#database-alerts'
    email_configs:
      - to: '[email protected]'

inhibit_rules:
  # If service is down, don't alert on high latency
  - source_match:
      alertname: 'ServiceDown'
    target_match:
      alertname: 'HighLatency'
    equal: ['instance']

Alert Routing Strategy

Here's how I route alerts:

Severity Guidelines:

Critical: Wake me up at 3 AM

Service completely down
Data loss occurring
Security breach
User-facing errors >5%

Warning: Tell me during work hours

Performance degradation
Resource usage high
Non-critical errors
Predictive alerts (disk will be full in 4 hours)

Info: Log it, I'll check when convenient

Deployment notifications
Configuration changes
Informational metrics

Silencing and Inhibition

Silencing During Maintenance

# Silence alerts for 2 hours using amtool
amtool silence add \
  alertname=HighLatency \
  --duration=2h \
  --comment="Deploying new version"

# Silence all alerts for a specific instance
amtool silence add \
  instance=api-1.company.com \
  --duration=30m \
  --comment="Upgrading server"

Inhibition Rules

Prevent redundant alerts:

inhibit_rules:
  # If instance is down, don't alert on its metrics
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '.*'
    equal: ['instance']
  
  # If cluster is down, don't alert on services
  - source_match:
      alertname: 'ClusterDown'
    target_match:
      severity: 'warning'
    equal: ['cluster']

Recording Rules for Alerts

For complex or expensive queries, use recording rules:

# recording-rules.yml
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Pre-calculate error rate
      - record: api:error_rate:5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
            /
          sum(rate(http_requests_total[5m])) by (route)
      
      # Pre-calculate 95th percentile latency
      - record: api:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
          )

# alerts.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      # Use pre-calculated metric
      - alert: HighErrorRate
        expr: api:error_rate:5m > 0.05
        for: 5m
        labels:
          severity: critical

Benefits:

Faster alert evaluation
Consistent calculations across alerts and dashboards
Easier to understand alert rules

Testing Alerts

Always test before deploying!

1. Validate Syntax

promtool check rules alerts.yml

2. Test Alert Query

# Test the PromQL expression
promtool query instant http://localhost:9090 \
  'sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05'

3. Trigger Test Alert

Create a temporary rule with a condition that will fire:

- alert: TestAlert
  expr: vector(1)  # Always true
  for: 0s
  labels:
    severity: info
  annotations:
    summary: "This is a test"

4. Check Alertmanager

# View active alerts
curl http://localhost:9093/api/v2/alerts

# View silences
curl http://localhost:9093/api/v2/silences

Common Alerting Mistakes I Fixed

Mistake 1: Alert Fatigue

❌ Too many alerts:

- alert: SingleError
  expr: http_requests_total{status_code="500"} > 0
  for: 0s  # Fires immediately!

✅ Better:

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
    > 0.05
  for: 5m  # Sustained problem

Mistake 2: Alerts Without Action

❌ Vague:

annotations:
  summary: "Something is wrong"

✅ Actionable:

annotations:
  summary: "High error rate on /api/users"
  description: "Error rate: {{ $value | humanizePercentage }}"
  runbook: "https://wiki.company.com/runbooks/api-errors"
  dashboard: "https://grafana.company.com/d/api"

Mistake 3: Wrong `for` Duration

❌ Too short:

for: 30s  # Too sensitive to brief spikes

✅ Right:

for: 5m  # Confirms sustained issue

Mistake 4: Alert on Predictions

❌ Unnecessary complexity:

expr: predict_linear(disk_free[1h], 24*3600) < 0

✅ Simpler and clearer:

expr: disk_free_percent < 10

My Alert Checklist

Before adding any alert, I verify:

Is it actionable? (Can I do something about it?)
Is the threshold right? (Not too sensitive, not too loose)
Is the for duration appropriate?
Does it have clear annotations?
Is there a runbook link?
Is the severity correct?
Did I test it?
Will this wake me up for a good reason?

Complete Production Setup

# prometheus/alerts/api.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      - alert: APIDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "API instance down"
          description: "{{ $labels.instance }} has been down for 1 minute"
      
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
            / sum(rate(http_requests_total[5m])) by (route)
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.route }}"
          description: "Error rate: {{ $value | humanizePercentage }}"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
          ) > 1
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.route }}"
          description: "95th percentile: {{ $value }}s"
      
      - alert: HighMemory
        expr: |
          nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes > 0.90
        for: 15m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High memory usage"
          description: "Memory: {{ $value | humanizePercentage }}"

# prometheus/alerts/database.yml
groups:
  - name: database_alerts
    interval: 30s
    rules:
      - alert: DatabaseConnectionsLow
        expr: db_connections_idle < 2
        for: 5m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "Database connections nearly exhausted"
          description: "Only {{ $value }} idle connections"
      
      - alert: SlowQueries
        expr: |
          histogram_quantile(0.99,
            sum(rate(db_query_duration_seconds_bucket[5m])) by (le, table)
          ) > 1
        for: 10m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "Slow queries on {{ $labels.table }}"
          description: "99th percentile: {{ $value }}s"

Key Takeaways

Less is more - Fewer, meaningful alerts beat many noisy ones
Use for duration - Prevent alert fatigue from temporary spikes
Clear annotations - Include runbook links and dashboards
Route by severity - Critical alerts page, warnings go to Slack
Test everything - Validate syntax and test queries
Use recording rules - Pre-calculate complex metrics
Silence during maintenance - Don't page yourself during deploys
Review regularly - Tune thresholds based on real incidents

Good alerting isn't about catching every problem—it's about catching the right problems at the right time and providing the information needed to fix them quickly.

In the next article, we'll visualize all this data with Grafana dashboards.

Previous: Prometheus Configuration Next: Visualization with Grafana

PreviousPrometheus Configuration: From Localhost to Production NextVisualization with Grafana: Making Metrics Beautiful and Useful

Last updated 15 hours ago

hashtagThe 3 AM Alert That Taught Me Everything

hashtagThe Philosophy of Good Alerting

hashtagAlert Rules Basics

hashtagBasic Alert Structure

hashtagMy Production Alert Rules for TypeScript APIs

hashtag1. High Error Rate

hashtag2. API Latency Too High

hashtag3. Service Down

hashtag4. High Memory Usage

hashtag5. Database Connection Pool Exhausted

hashtag6. Slow Database Queries

hashtag7. Disk Space Running Out

hashtag8. Request Rate Anomaly (Comparison)

hashtagAlertmanager Configuration

hashtagBasic alertmanager.yml

hashtagAlert Routing Strategy

hashtagSilencing and Inhibition

hashtagSilencing During Maintenance

hashtagInhibition Rules

hashtagRecording Rules for Alerts

hashtagTesting Alerts

hashtag1. Validate Syntax

hashtag2. Test Alert Query

hashtag3. Trigger Test Alert

hashtag4. Check Alertmanager

hashtagCommon Alerting Mistakes I Fixed

hashtagMistake 1: Alert Fatigue

hashtagMistake 2: Alerts Without Action

hashtagMistake 3: Wrong for Duration

hashtagMistake 4: Alert on Predictions

hashtagMy Alert Checklist

hashtagComplete Production Setup

hashtagKey Takeaways