Alerting with Prometheus: Getting Woken Up Only When It Matters
The 3 AM Alert That Taught Me Everything
The Philosophy of Good Alerting
Alert Rules Basics
Basic Alert Structure
My Production Alert Rules for TypeScript APIs
1. High Error Rate
2. API Latency Too High
3. Service Down
4. High Memory Usage
5. Database Connection Pool Exhausted
6. Slow Database Queries
7. Disk Space Running Out
8. Request Rate Anomaly (Comparison)
Alertmanager Configuration
Basic alertmanager.yml
Alert Routing Strategy
Silencing and Inhibition
Silencing During Maintenance
Inhibition Rules
Recording Rules for Alerts
Testing Alerts
1. Validate Syntax
2. Test Alert Query
3. Trigger Test Alert
4. Check Alertmanager
Common Alerting Mistakes I Fixed
Mistake 1: Alert Fatigue
Mistake 2: Alerts Without Action
Mistake 3: Wrong for Duration
for DurationMistake 4: Alert on Predictions
My Alert Checklist
Complete Production Setup
Key Takeaways
PreviousPrometheus Configuration: From Localhost to ProductionNextVisualization with Grafana: Making Metrics Beautiful and Useful
Last updated