Part 7: Real-World KQL Patterns and Production Use Cases

From Learning to Practice

Throughout this series, we've built a strong foundation in KQL. In this final part, I'll share production-ready query patterns from my work in SRE, security monitoring, and application observability. These are patterns I use daily to keep systems running smoothly.

Production Monitoring Patterns

Pattern 1: Golden Signals Monitoring

The four golden signals (latency, traffic, errors, saturation) form the foundation of my service monitoring.

Complete golden signals dashboard query:

// Golden Signals for Application Service
let timeRange = 5m;
let latencyThreshold = 1000;  // ms
let errorThreshold = 1.0;  // percent
let trafficWindow = 1m;

// 1. Latency (Response Time)
let latencyMetrics = AppRequests
| where TimeGenerated > ago(timeRange)
| summarize 
    P50 = percentile(DurationMs, 50),
    P95 = percentile(DurationMs, 95),
    P99 = percentile(DurationMs, 99),
    AvgLatency = avg(DurationMs)
| extend 
    LatencyStatus = case(
        P95 > latencyThreshold, "πŸ”΄ Critical",
        P95 > (latencyThreshold * 0.8), "🟑 Warning",
        "🟒 Healthy"
    ),
    Signal = "Latency";

// 2. Traffic (Requests Per Second)
let trafficMetrics = AppRequests
| where TimeGenerated > ago(timeRange)
| summarize RequestCount = count()
| extend 
    RequestsPerSecond = RequestCount / datetime_diff('second', timeRange, 0s),
    TrafficStatus = "🟒 Healthy",
    Signal = "Traffic";

// 3. Errors (Error Rate)
let errorMetrics = AppRequests
| where TimeGenerated > ago(timeRange)
| summarize 
    TotalRequests = count(),
    FailedRequests = countif(Success == false)
| extend 
    ErrorRate = 100.0 * FailedRequests / TotalRequests,
    ErrorStatus = case(
        (100.0 * FailedRequests / TotalRequests) > errorThreshold, "πŸ”΄ Critical",
        (100.0 * FailedRequests / TotalRequests) > (errorThreshold * 0.5), "🟑 Warning",
        "🟒 Healthy"
    ),
    Signal = "Errors";

// 4. Saturation (Resource Utilization)
let saturationMetrics = Perf
| where TimeGenerated > ago(timeRange)
| where CounterName in ("% Processor Time", "% Used Memory")
| summarize AvgValue = avg(CounterValue) by CounterName
| summarize 
    AvgCpu = sumif(AvgValue, CounterName == "% Processor Time"),
    AvgMemory = sumif(AvgValue, CounterName == "% Used Memory")
| extend 
    MaxUtilization = iff(AvgCpu > AvgMemory, AvgCpu, AvgMemory),
    SaturationStatus = case(
        iff(AvgCpu > AvgMemory, AvgCpu, AvgMemory) > 90, "πŸ”΄ Critical",
        iff(AvgCpu > AvgMemory, AvgCpu, AvgMemory) > 80, "🟑 Warning",
        "🟒 Healthy"
    ),
    Signal = "Saturation";

// Combine all signals
union
    (latencyMetrics | project Signal, Status = LatencyStatus, Value = P95, Unit = "ms"),
    (trafficMetrics | project Signal, Status = TrafficStatus, Value = RequestsPerSecond, Unit = "req/s"),
    (errorMetrics | project Signal, Status = ErrorStatus, Value = ErrorRate, Unit = "%"),
    (saturationMetrics | project Signal, Status = SaturationStatus, Value = MaxUtilization, Unit = "%")
| project Signal, Value, Unit, Status

Pattern 2: Service Level Objectives (SLO) Tracking

Track SLO compliance and error budgets:

Pattern 3: Anomaly Detection with Baseline

Detect anomalies by comparing against historical baselines:

Pattern 4: Dependency Health Matrix

Monitor all external dependencies comprehensively:

Security Monitoring Patterns

Pattern 5: Failed Authentication Analysis

Track and analyze authentication failures:

Pattern 6: Security Event Correlation

Correlate security events across multiple data sources:

Performance Analysis Patterns

Pattern 7: Response Time Percentile Distribution

Understand latency distribution beyond averages:

Pattern 8: Database Query Performance Analysis

Identify slow queries and optimization opportunities:

Capacity Planning Patterns

Pattern 9: Resource Growth Trend Analysis

Predict future capacity needs:

Pattern 10: Storage Capacity Forecasting

Predict disk space exhaustion:

Troubleshooting Patterns

Pattern 11: Error Spike Investigation

Quickly investigate sudden error increases:

Pattern 12: Deployment Impact Analysis

Analyze impact of recent deployments:

Advanced Analytics Patterns

Pattern 13: User Journey Analysis

Track user behavior patterns:

Pattern 14: Correlation Analysis Between Metrics

Find correlations between different performance metrics:

Key Takeaways

  • Golden signals (latency, traffic, errors, saturation) provide comprehensive service health

  • SLO tracking with error budgets enables data-driven reliability decisions

  • Anomaly detection with baselines catches issues before they impact users

  • Security patterns help identify and respond to threats quickly

  • Performance analysis beyond averages reveals hidden issues

  • Capacity planning prevents resource exhaustion

  • Deployment impact analysis ensures safe releases

  • Correlation analysis uncovers relationships between metrics

Conclusion

Throughout this KQL series, we've progressed from basic queries to production-ready observability patterns. The key to mastery is practice and iteration - start with simple queries, understand your data, and gradually build more sophisticated analysis.

Remember:

  • Always filter by time first

  • Understand your data schemas

  • Optimize for performance

  • Document your queries

  • Build reusable patterns

  • Share knowledge with your team

Keep querying, keep learning, and use these patterns to build robust observability into your systems. KQL is a powerful tool - wield it wisely!

Additional Resources

  • Microsoft KQL documentation

  • Azure Monitor documentation

  • Log Analytics workspace best practices

  • Community query repositories

  • Azure Monitor Workbooks gallery

Happy querying! πŸš€

Last updated