Visualization with Grafana: Making Metrics Beautiful and Useful

The Dashboard That Changed Everything

For weeks after implementing Prometheus, I queried metrics through the Prometheus UI. It worked, but it was clunky. I had to remember complex PromQL queries. I couldn't see trends at a glance. My team couldn't easily check system health.

Then I spent an afternoon setting up Grafana. That evening, my team lead walked by my desk, saw the dashboard on my screen, and stopped. "Wait, is that our API? I can actually see what's happening. This is incredible."

Within a week, everyone on the team had the dashboard bookmarked. We caught issues faster. Stakeholders could see system health without asking me. And most importantly, I could understand system behavior at a glance instead of writing queries.

Good visualization turns raw metrics into understanding.

Setting Up Grafana with Prometheus

Installation with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
  
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus

volumes:
  prometheus-data:
  grafana-data:

Start everything:

docker-compose up -d

Access Grafana at http://localhost:3000 (admin/admin).

Adding Prometheus as a Data Source

Manual Setup:

Go to Configuration → Data Sources
Click "Add data source"
Select "Prometheus"
URL: http://prometheus:9090 (in Docker) or http://localhost:9090 (local)
Click "Save & Test"

Automated Setup (Provisioning):

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Now Grafana automatically connects to Prometheus on startup.

Building Your First Dashboard

Let's build a comprehensive API monitoring dashboard step by step.

Panel 1: Request Rate

What: Requests per second over time

Query:

sum(rate(http_requests_total[5m]))

Panel Configuration:

Visualization: Time series (Graph)
Title: "Requests per Second"
Y-axis label: "req/s"
Legend: "Total Requests"

Advanced: Show by endpoint:

sum(rate(http_requests_total[5m])) by (route)

Panel 2: Error Rate

What: Percentage of requests returning 5xx errors

Query:

sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))
  * 100

Panel Configuration:

Visualization: Time series
Title: "Error Rate (%)"
Y-axis: Unit = "percent (0-100)"
Thresholds: Yellow at 1%, Red at 5%
Alert: > 5% for 5 minutes

Panel 3: Response Time Percentiles

What: P50, P95, P99 latency

Queries:

# P50 (median)
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# P95
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Panel Configuration:

Visualization: Time series
Title: "Response Time Percentiles"
Y-axis: Unit = "seconds (s)"
Legend: "P50", "P95", "P99"
Thresholds: Yellow at 0.5s, Red at 1s

Panel 4: Status Code Breakdown

What: Traffic distribution by status code

Query:

sum(rate(http_requests_total[5m])) by (status_code)

Panel Configuration:

Visualization: Bar gauge or Pie chart
Title: "Requests by Status Code"
Legend: Show values and percentages

Panel 5: Memory Usage

What: Node.js heap usage over time

Query:

nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes * 100

Panel Configuration:

Visualization: Time series with fill
Title: "Memory Usage (%)"
Y-axis: Unit = "percent (0-100)", Min = 0, Max = 100
Thresholds: Yellow at 70%, Red at 90%

Panel 6: Active Database Connections

What: Current active database connections

Query:

db_connections_active

Panel Configuration:

Visualization: Stat (single number)
Title: "Active DB Connections"
Thresholds: Green < 15, Yellow < 18, Red >= 18

My Production API Dashboard Layout

Here's how I organize panels:

┌─────────────────────────────────────────────────────────┐
│ API Performance Dashboard                    [24h] [▼]  │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐    │
│  │   Requests   │ │  Error Rate  │ │   P95 Latency│    │
│  │   125 req/s  │ │    0.23%     │ │    245ms     │    │
│  └──────────────┘ └──────────────┘ └──────────────┘    │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │         Request Rate Over Time                   │   │
│  │  [Graph showing requests/sec]                    │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
│  ┌──────────────────────────────────────────────────┐   │
│  │         Response Time Percentiles (P50/P95/P99)  │   │
│  │  [Graph showing latency lines]                   │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
│  ┌─────────────────────┐ ┌─────────────────────────┐   │
│  │  Error Rate (%)     │ │  Status Code Breakdown  │   │
│  │  [Line graph]       │ │  [Pie chart]            │   │
│  └─────────────────────┘ └─────────────────────────┘   │
│                                                          │
│  ┌─────────────────────┐ ┌─────────────────────────┐   │
│  │  Memory Usage       │ │  DB Connections         │   │
│  │  [Area graph]       │ │  [Line graph]           │   │
│  └─────────────────────┘ └─────────────────────────┘   │
│                                                          │
└─────────────────────────────────────────────────────────┘

Dashboard Variables for Flexibility

Variables make dashboards dynamic and reusable.

Example: Environment Variable

Dashboard Settings → Variables → Add Variable:

Name: environment
Type: Query
Data source: Prometheus
Query: label_values(http_requests_total, environment)
Multi-value: Enable
Include All: Enable

Use in queries:

sum(rate(http_requests_total{environment=~"$environment"}[5m]))

Now you can switch between dev/staging/production!

Example: Endpoint Variable

Name: endpoint
Query: label_values(http_requests_total, route)
Multi-value: Enable

Use in panels:

sum(rate(http_requests_total{route=~"$endpoint"}[5m]))

Filter dashboards by specific endpoints.

Example: Time Range Variable

Use Grafana's built-in time picker:

sum(rate(http_requests_total[$__rate_interval]))

$__rate_interval automatically adjusts based on time range.

Essential Dashboard Panels for TypeScript APIs

Traffic Panel

# Total traffic
sum(rate(http_requests_total[5m]))

# By endpoint
topk(10, sum(rate(http_requests_total[5m])) by (route))

# By method
sum(rate(http_requests_total[5m])) by (method)

Error Tracking

# Error count
sum(rate(http_requests_total{status_code=~"5.."}[5m]))

# Error rate by endpoint
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route)
  /
sum(rate(http_requests_total[5m])) by (route)
  * 100

# 4xx vs 5xx
sum(rate(http_requests_total{status_code=~"4.."}[5m]))
sum(rate(http_requests_total{status_code=~"5.."}[5m]))

Performance Metrics

# Average response time
rate(http_request_duration_seconds_sum[5m])
  /
rate(http_request_duration_seconds_count[5m])

# Slowest endpoints
topk(5,
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
  )
)

Resource Usage

# Memory
nodejs_heap_size_used_bytes / 1024 / 1024  # MB

# Memory percentage
nodejs_heap_size_used_bytes / nodejs_heap_size_total_bytes * 100

# Event loop lag
nodejs_eventloop_lag_seconds

# CPU usage (if available)
rate(process_cpu_seconds_total[5m]) * 100

Database Metrics

# Connection pool
db_connections_active
db_connections_idle

# Query rate
sum(rate(db_query_duration_seconds_count[5m])) by (operation)

# Slow queries
histogram_quantile(0.99,
  sum(rate(db_query_duration_seconds_bucket[5m])) by (le, table)
)

Advanced Visualization Techniques

Heatmaps for Latency Distribution

Visualization: Heatmap Query:

sum(increase(http_request_duration_seconds_bucket[1m])) by (le)

Format: Heatmap What it shows: Latency distribution over time (darker = more requests at that latency)

Table for Top Endpoints

Visualization: Table Queries:

Metric

Query

Endpoint

label_values(http_requests_total, route)

Requests/s

sum(rate(http_requests_total[5m])) by (route)

Error %

sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (route) / sum(rate(http_requests_total[5m])) by (route) * 100

P95 Latency

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

Alert State Panel

Shows current firing alerts:

Visualization: Alert list Options: Show "Alerting" and "No Data" states

Dashboard Organization Strategy

I organize dashboards by audience and purpose:

For Developers (Detailed)

API Performance Dashboard - All HTTP metrics
Database Performance Dashboard - Query performance, connections
Application Health Dashboard - Memory, CPU, errors
Business Metrics Dashboard - Signups, purchases, active users

For Operations (Overview)

System Overview Dashboard - High-level health across all services
Infrastructure Dashboard - Host metrics, disk, network
SLA Dashboard - Uptime, error rates, latency

For Management (Executive)

Service Health Dashboard - Red/yellow/green indicators
Business KPIs Dashboard - User metrics, revenue, conversion

Dashboard JSON Export/Import

Export for version control:

Dashboard → Share → Export
Save JSON to grafana/dashboards/api-performance.json
Commit to Git

Auto-provision dashboards:

# grafana/provisioning/dashboards/default.yml
apiVersion: 1

providers:
  - name: 'Default'
    folder: ''
    type: file
    options:
      path: /etc/grafana/dashboards

Place JSON files in grafana/dashboards/, and they load automatically!

Dashboard Best Practices from Experience

1. Use Consistent Time Ranges

All panels should use the same time range:

rate(metric[5m])  # Use 5m everywhere

Not:

rate(metric1[1m])   # Inconsistent!
rate(metric2[5m])
rate(metric3[10m])

2. Set Meaningful Y-Axis Ranges

❌ Auto-scale on percentage metrics (0-100% might show as 99.9-100%)

✅ Fix range:

Min: 0
Max: 100

3. Use Units

Always set appropriate units:

Time: seconds (s), milliseconds (ms)
Data: bytes, kilobytes, megabytes
Rate: ops/sec, req/sec
Percentage: percent (0-100)

4. Color Code Thresholds

Green: Normal operation
Yellow: Warning level
Red: Critical level

Example:

0-70%: Green
70-90%: Yellow
90-100%: Red

5. Add Panel Descriptions

Click panel title → Edit → Panel → Description:

Shows the 95th percentile response time for all API endpoints.
Values above 1 second indicate performance degradation.
Check database query performance if this spikes.

Add dashboard links at the top:

API Performance → Database Performance
Database Performance → Infrastructure
Infrastructure → Alert History

7. Use Annotations for Deploys

Mark deployments on graphs:

# Send annotation via API
curl -X POST http://admin:admin@localhost:3000/api/annotations \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Deployed v2.1.3",
    "tags": ["deployment"]
  }'

Now you can correlate performance changes with deployments!

My Essential Dashboard Collection

1. API Performance Dashboard

Focus: Request rates, errors, latency

Panels:

Request rate (line graph)
Error rate % (line graph with threshold)
P50/P95/P99 latency (multi-line graph)
Status code distribution (bar chart)
Top 10 endpoints by traffic (table)
Top 5 slowest endpoints (table)

2. Application Health Dashboard

Focus: Resource usage, application metrics

Panels:

Memory usage % (area graph)
Heap size used vs total (dual line)
Event loop lag (line graph)
Active connections (stat)
Error log rate (line graph)

3. Database Dashboard

Focus: Database performance

Panels:

Query rate by operation (stacked area)
P99 query latency (line graph)
Connection pool usage (dual line: active + idle)
Slow query count (stat with threshold)
Top tables by query count (table)

4. Business Metrics Dashboard

Focus: User behavior and revenue

Panels:

Active users (stat)
Signups per hour (bar chart)
Revenue per hour (area graph)
Conversion rate (line graph)
Popular features (table)

Grafana Alerting

You can also configure alerts in Grafana (though I prefer Prometheus alerts).

Create Alert:

Edit panel
Alert tab
Create Alert
Condition: WHEN last() OF query(A, 5m, now) IS ABOVE 5
Notification: Choose channel

Notification Channels:

Email
Slack
PagerDuty
Webhook
Microsoft Teams

Performance Tips

1. Limit Time Series

Too many series slow down dashboards:

# Bad: might return 1000s of series
sum(rate(http_requests_total[5m]))

# Good: aggregate by meaningful labels
sum(rate(http_requests_total[5m])) by (route)

# Better: limit results
topk(10, sum(rate(http_requests_total[5m])) by (route))

2. Use Recording Rules

For complex queries used in multiple panels:

# Prometheus recording rule
- record: api:request_rate:5m
  expr: sum(rate(http_requests_total[5m])) by (route)

Then in Grafana:

api:request_rate:5m

Much faster!

3. Adjust Refresh Rate

Real-time monitoring: 5-10 seconds
Regular monitoring: 30-60 seconds
Historical analysis: Manual refresh

4. Use Dashboard Caching

Configure in grafana.ini:

[caching]
enabled = true
ttl = 60

Key Takeaways

Start with the basics - Request rate, errors, latency
Use variables - Make dashboards flexible and reusable
Organize by audience - Different dashboards for different needs
Set proper units and thresholds - Make interpretation easy
Export dashboards - Version control in Git
Link related dashboards - Easy navigation
Add descriptions - Help future you understand panels
Optimize queries - Limit time series, use recording rules

A good dashboard tells a story about your system's health. It should answer questions before they're asked and make problems obvious at a glance.

In the final article, we'll cover best practices learned from running Prometheus in production.

Previous: Alerting with Prometheus Next: Prometheus Best Practices

PreviousAlerting with Prometheus: Getting Woken Up Only When It Matters NextPrometheus Best Practices: Lessons from Production

Last updated 15 hours ago

hashtagThe Dashboard That Changed Everything

hashtagSetting Up Grafana with Prometheus

hashtagInstallation with Docker Compose

hashtagAdding Prometheus as a Data Source

hashtagBuilding Your First Dashboard

hashtagPanel 1: Request Rate

hashtagPanel 2: Error Rate

hashtagPanel 3: Response Time Percentiles

hashtagPanel 4: Status Code Breakdown

hashtagPanel 5: Memory Usage

hashtagPanel 6: Active Database Connections

hashtagMy Production API Dashboard Layout

hashtagDashboard Variables for Flexibility

hashtagExample: Environment Variable

hashtagExample: Endpoint Variable

hashtagExample: Time Range Variable

hashtagEssential Dashboard Panels for TypeScript APIs

hashtagTraffic Panel

hashtagError Tracking

hashtagPerformance Metrics

hashtagResource Usage

hashtagDatabase Metrics

hashtagAdvanced Visualization Techniques

hashtagHeatmaps for Latency Distribution

hashtagTable for Top Endpoints

hashtagAlert State Panel

hashtagDashboard Organization Strategy

hashtagFor Developers (Detailed)

hashtagFor Operations (Overview)

hashtagFor Management (Executive)

hashtagDashboard JSON Export/Import

hashtagDashboard Best Practices from Experience

hashtag1. Use Consistent Time Ranges

hashtag2. Set Meaningful Y-Axis Ranges

hashtag3. Use Units

hashtag4. Color Code Thresholds

hashtag5. Add Panel Descriptions

hashtag6. Link to Related Dashboards

hashtag7. Use Annotations for Deploys

hashtagMy Essential Dashboard Collection

hashtag1. API Performance Dashboard

hashtag2. Application Health Dashboard

hashtag3. Database Dashboard

hashtag4. Business Metrics Dashboard

hashtagGrafana Alerting

hashtagPerformance Tips

hashtag1. Limit Time Series

hashtag2. Use Recording Rules

hashtag3. Adjust Refresh Rate

hashtag4. Use Dashboard Caching

hashtagKey Takeaways

The Dashboard That Changed Everything

Setting Up Grafana with Prometheus

Installation with Docker Compose

Adding Prometheus as a Data Source

Building Your First Dashboard

Panel 1: Request Rate

Panel 2: Error Rate

Panel 3: Response Time Percentiles

Panel 4: Status Code Breakdown

Panel 5: Memory Usage

Panel 6: Active Database Connections

My Production API Dashboard Layout

Dashboard Variables for Flexibility

Example: Environment Variable

Example: Endpoint Variable

Example: Time Range Variable

Essential Dashboard Panels for TypeScript APIs

Traffic Panel

Error Tracking

Performance Metrics

Resource Usage

Database Metrics

Advanced Visualization Techniques

Heatmaps for Latency Distribution

Table for Top Endpoints

Alert State Panel

Dashboard Organization Strategy

For Developers (Detailed)

For Operations (Overview)

For Management (Executive)

Dashboard JSON Export/Import

Dashboard Best Practices from Experience

1. Use Consistent Time Ranges

2. Set Meaningful Y-Axis Ranges

3. Use Units

4. Color Code Thresholds

5. Add Panel Descriptions

6. Link to Related Dashboards

7. Use Annotations for Deploys

My Essential Dashboard Collection

1. API Performance Dashboard

2. Application Health Dashboard

3. Database Dashboard

4. Business Metrics Dashboard

Grafana Alerting

Performance Tips

1. Limit Time Series

2. Use Recording Rules

3. Adjust Refresh Rate

4. Use Dashboard Caching

Key Takeaways