Part 7: Real-World KQL Patterns and Production Use Cases

From Learning to Practice

Throughout this series, we've built a strong foundation in KQL. In this final part, I'll share production-ready query patterns from my work in SRE, security monitoring, and application observability. These are patterns I use daily to keep systems running smoothly.

Production Monitoring Patterns

Pattern 1: Golden Signals Monitoring

The four golden signals (latency, traffic, errors, saturation) form the foundation of my service monitoring.

Complete golden signals dashboard query:

// Golden Signals for Application Service
let timeRange = 5m;
let latencyThreshold = 1000;  // ms
let errorThreshold = 1.0;  // percent
let trafficWindow = 1m;

// 1. Latency (Response Time)
let latencyMetrics = AppRequests
| where TimeGenerated > ago(timeRange)
| summarize 
    P50 = percentile(DurationMs, 50),
    P95 = percentile(DurationMs, 95),
    P99 = percentile(DurationMs, 99),
    AvgLatency = avg(DurationMs)
| extend 
    LatencyStatus = case(
        P95 > latencyThreshold, "🔴 Critical",
        P95 > (latencyThreshold * 0.8), "🟡 Warning",
        "🟢 Healthy"
    ),
    Signal = "Latency";

// 2. Traffic (Requests Per Second)
let trafficMetrics = AppRequests
| where TimeGenerated > ago(timeRange)
| summarize RequestCount = count()
| extend 
    RequestsPerSecond = RequestCount / datetime_diff('second', timeRange, 0s),
    TrafficStatus = "🟢 Healthy",
    Signal = "Traffic";

// 3. Errors (Error Rate)
let errorMetrics = AppRequests
| where TimeGenerated > ago(timeRange)
| summarize 
    TotalRequests = count(),
    FailedRequests = countif(Success == false)
| extend 
    ErrorRate = 100.0 * FailedRequests / TotalRequests,
    ErrorStatus = case(
        (100.0 * FailedRequests / TotalRequests) > errorThreshold, "🔴 Critical",
        (100.0 * FailedRequests / TotalRequests) > (errorThreshold * 0.5), "🟡 Warning",
        "🟢 Healthy"
    ),
    Signal = "Errors";

// 4. Saturation (Resource Utilization)
let saturationMetrics = Perf
| where TimeGenerated > ago(timeRange)
| where CounterName in ("% Processor Time", "% Used Memory")
| summarize AvgValue = avg(CounterValue) by CounterName
| summarize 
    AvgCpu = sumif(AvgValue, CounterName == "% Processor Time"),
    AvgMemory = sumif(AvgValue, CounterName == "% Used Memory")
| extend 
    MaxUtilization = iff(AvgCpu > AvgMemory, AvgCpu, AvgMemory),
    SaturationStatus = case(
        iff(AvgCpu > AvgMemory, AvgCpu, AvgMemory) > 90, "🔴 Critical",
        iff(AvgCpu > AvgMemory, AvgCpu, AvgMemory) > 80, "🟡 Warning",
        "🟢 Healthy"
    ),
    Signal = "Saturation";

// Combine all signals
union
    (latencyMetrics | project Signal, Status = LatencyStatus, Value = P95, Unit = "ms"),
    (trafficMetrics | project Signal, Status = TrafficStatus, Value = RequestsPerSecond, Unit = "req/s"),
    (errorMetrics | project Signal, Status = ErrorStatus, Value = ErrorRate, Unit = "%"),
    (saturationMetrics | project Signal, Status = SaturationStatus, Value = MaxUtilization, Unit = "%")
| project Signal, Value, Unit, Status

Pattern 2: Service Level Objectives (SLO) Tracking

Track SLO compliance and error budgets:

// SLO Monitoring: 99.9% availability target
let sloTarget = 99.9;
let evaluationWindow = 30d;
let timeGrain = 1h;

AppRequests
| where TimeGenerated > ago(evaluationWindow)
| summarize 
    TotalRequests = count(),
    SuccessfulRequests = countif(Success == true and DurationMs < 1000)  // Define success criteria
    by bin(TimeGenerated, timeGrain)
| extend 
    HourlyAvailability = 100.0 * SuccessfulRequests / TotalRequests,
    SLOTarget = sloTarget
| extend 
    ErrorBudget = case(
        HourlyAvailability >= sloTarget, 0.0,
        sloTarget - HourlyAvailability
    ),
    Status = case(
        HourlyAvailability >= sloTarget, "🟢 Within SLO",
        HourlyAvailability >= (sloTarget * 0.99), "🟡 Near SLO",
        "🔴 Breaching SLO"
    )
// Calculate running SLO
| extend RunningSuccessful = row_cumsum(SuccessfulRequests)
| extend RunningTotal = row_cumsum(TotalRequests)
| extend RunningSLO = 100.0 * RunningSuccessful / RunningTotal
| project 
    TimeGenerated,
    HourlyAvailability,
    RunningSLO,
    ErrorBudget,
    Status,
    SLOTarget
| order by TimeGenerated asc

Pattern 3: Anomaly Detection with Baseline

Detect anomalies by comparing against historical baselines:

// Detect Request Rate Anomalies
let currentWindow = 1h;
let baselineWindowDays = 7;
let sensitivity = 2.5;  // Standard deviations

// Calculate current metrics
let currentMetrics = AppRequests
| where TimeGenerated > ago(currentWindow)
| summarize 
    CurrentRate = count() / datetime_diff('minute', currentWindow, 0m),
    TimeWindow = "Current";

// Calculate baseline (same hour for past N days)
let currentHour = datetime_part("hour", now());
let baseline = AppRequests
| where TimeGenerated between (ago(baselineWindowDays) .. ago(1d))
| where datetime_part("hour", TimeGenerated) == currentHour
| summarize 
    AvgRate = avg(CountPerHour),
    StdDevRate = stdev(CountPerHour)
    from (
        summarize CountPerHour = count() 
        by format_datetime(TimeGenerated, "yyyy-MM-dd"), bin(TimeGenerated, 1h)
    );

// Detect anomalies
currentMetrics
| extend dummy = 1
| join kind=inner (baseline | extend dummy = 1) on dummy
| extend 
    ZScore = (CurrentRate - AvgRate) / StdDevRate,
    UpperBound = AvgRate + (sensitivity * StdDevRate),
    LowerBound = AvgRate - (sensitivity * StdDevRate),
    IsAnomaly = abs((CurrentRate - AvgRate) / StdDevRate) > sensitivity,
    AnomalyType = case(
        CurrentRate > (AvgRate + (sensitivity * StdDevRate)), "📈 Spike",
        CurrentRate < (AvgRate - (sensitivity * StdDevRate)), "📉 Drop",
        "✅ Normal"
    )
| project 
    CurrentRate,
    BaselineAvg = AvgRate,
    StdDev = StdDevRate,
    ZScore = round(ZScore, 2),
    UpperBound = round(UpperBound, 0),
    LowerBound = round(LowerBound, 0),
    AnomalyType,
    IsAnomaly

Pattern 4: Dependency Health Matrix

Monitor all external dependencies comprehensively:

// Comprehensive Dependency Health Check
let timeWindow = 15m;
let errorThreshold = 5.0;  // percent
let latencyThreshold = 1000;  // ms

AppDependencies
| where TimeGenerated > ago(timeWindow)
| summarize 
    CallCount = count(),
    FailureCount = countif(Success == false),
    AvgDuration = avg(DurationMs),
    P50Duration = percentile(DurationMs, 50),
    P95Duration = percentile(DurationMs, 95),
    P99Duration = percentile(DurationMs, 99),
    MaxDuration = max(DurationMs)
    by Target, Type
| extend 
    SuccessRate = round(100.0 * (CallCount - FailureCount) / CallCount, 2),
    ErrorRate = round(100.0 * FailureCount / CallCount, 2),
    CallsPerMinute = round(CallCount / datetime_diff('minute', timeWindow, 0m), 1)
| extend 
    HealthScore = case(
        ErrorRate > errorThreshold or P95Duration > latencyThreshold, 0,  // Unhealthy
        ErrorRate > (errorThreshold * 0.5) or P95Duration > (latencyThreshold * 0.8), 50,  // Degraded
        100  // Healthy
    ),
    Status = case(
        ErrorRate > errorThreshold or P95Duration > latencyThreshold, "🔴 Unhealthy",
        ErrorRate > (errorThreshold * 0.5) or P95Duration > (latencyThreshold * 0.8), "🟡 Degraded",
        "🟢 Healthy"
    ),
    LatencyStatus = case(
        P95Duration > latencyThreshold, "⚠️ Slow",
        "✅ Fast"
    )
| project 
    Target,
    Type,
    Status,
    HealthScore,
    CallsPerMinute,
    SuccessRate,
    ErrorRate,
    ['P50 (ms)'] = round(P50Duration, 0),
    ['P95 (ms)'] = round(P95Duration, 0),
    ['P99 (ms)'] = round(P99Duration, 0),
    LatencyStatus
| order by HealthScore asc, ErrorRate desc

Security Monitoring Patterns

Pattern 5: Failed Authentication Analysis

Track and analyze authentication failures:

// Failed Login Analysis with Threat Intelligence
let timeWindow = 24h;
let failureThreshold = 5;

// Failed logins by source
let failedLogins = Syslog
| where TimeGenerated > ago(timeWindow)
| where Facility in ("auth", "authpriv")
| where SyslogMessage contains "Failed password"
| extend 
    SourceIP = extract(@"from (\d+\.\d+\.\d+\.\d+)", 1, SyslogMessage),
    Username = extract(@"for (?:invalid user )?(\w+)", 1, SyslogMessage),
    InvalidUser = SyslogMessage contains "invalid user"
| where isnotempty(SourceIP);

// Aggregate by source IP
failedLogins
| summarize 
    AttemptCount = count(),
    TargetedUsers = make_set(Username),
    UniqueUsers = dcount(Username),
    InvalidUserAttempts = countif(InvalidUser),
    FirstAttempt = min(TimeGenerated),
    LastAttempt = max(TimeGenerated),
    TargetedServers = dcount(Computer)
    by SourceIP
| extend 
    AttackDuration = datetime_diff('minute', LastAttempt, FirstAttempt),
    ThreatLevel = case(
        AttemptCount > 100, "🔴 Critical",
        AttemptCount > 50, "🟠 High",
        AttemptCount > failureThreshold, "🟡 Medium",
        "🟢 Low"
    ),
    AttackPattern = case(
        InvalidUserAttempts > (AttemptCount * 0.5), "User Enumeration",
        UniqueUsers > 10, "Spray Attack",
        AttemptCount > 20 and UniqueUsers <= 3, "Brute Force",
        "Targeted"
    )
| where AttemptCount >= failureThreshold
| project 
    SourceIP,
    AttemptCount,
    UniqueUsers,
    TargetedServers,
    AttackPattern,
    ThreatLevel,
    AttackDuration,
    FirstAttempt,
    LastAttempt,
    TargetedUsers = substring(tostring(TargetedUsers), 0, 100)  // Truncate for display
| order by AttemptCount desc

Pattern 6: Security Event Correlation

Correlate security events across multiple data sources:

// Multi-Source Security Event Correlation
let investigationWindow = 1h;
let suspiciousIP = "203.0.113.45";  // Example IP to investigate

// Collect all events related to suspicious IP
let firewallEvents = AzureDiagnostics
| where TimeGenerated > ago(investigationWindow)
| where Category == "ApplicationGatewayFirewallLog"
| where clientIp_s == suspiciousIP
| project 
    TimeGenerated,
    Source = "Firewall",
    EventType = "Blocked Request",
    Details = strcat("Rule: ", ruleId_s, ", Message: ", message_s),
    Severity = "High";

let authEvents = Syslog
| where TimeGenerated > ago(investigationWindow)
| where SyslogMessage contains suspiciousIP
| project 
    TimeGenerated,
    Source = "Auth",
    EventType = "Authentication",
    Details = SyslogMessage,
    Severity = case(
        SyslogMessage contains "Failed", "Medium",
        "Low"
    );

let activityEvents = AzureActivity
| where TimeGenerated > ago(investigationWindow)
| where CallerIpAddress == suspiciousIP
| project 
    TimeGenerated,
    Source = "Azure Activity",
    EventType = OperationNameValue,
    Details = strcat("Caller: ", Caller, ", Resource: ", Resource),
    Severity = case(
        ActivityStatusValue contains "Fail", "High",
        OperationNameValue contains "delete", "Critical",
        "Low"
    );

// Combine and analyze
union firewallEvents, authEvents, activityEvents
| summarize 
    EventCount = count(),
    EventTypes = make_set(EventType),
    FirstSeen = min(TimeGenerated),
    LastSeen = max(TimeGenerated),
    HighSeverityCount = countif(Severity in ("High", "Critical"))
    by Source
| extend 
    Duration = datetime_diff('minute', LastSeen, FirstSeen),
    RiskScore = HighSeverityCount * 10 + EventCount
| order by RiskScore desc

Performance Analysis Patterns

Pattern 7: Response Time Percentile Distribution

Understand latency distribution beyond averages:

// Response Time Analysis with Percentile Distribution
let analysisWindow = 1h;

AppRequests
| where TimeGenerated > ago(analysisWindow)
| summarize 
    RequestCount = count(),
    P50 = percentile(DurationMs, 50),
    P75 = percentile(DurationMs, 75),
    P90 = percentile(DurationMs, 90),
    P95 = percentile(DurationMs, 95),
    P99 = percentile(DurationMs, 99),
    P999 = percentile(DurationMs, 99.9),
    Max = max(DurationMs),
    Avg = avg(DurationMs)
    by OperationName
| extend 
    P95ToP50Ratio = round(P95 / P50, 2),
    P99ToP50Ratio = round(P99 / P50, 2),
    LatencySpread = case(
        P99ToP50Ratio > 10, "Very High Variance",
        P99ToP50Ratio > 5, "High Variance",
        P99ToP50Ratio > 2, "Moderate Variance",
        "Low Variance"
    ),
    PerformanceGrade = case(
        P95 < 100, "🏆 Excellent",
        P95 < 500, "✅ Good",
        P95 < 1000, "⚠️ Acceptable",
        P95 < 3000, "🟡 Needs Improvement",
        "🔴 Critical"
    )
| project 
    OperationName,
    RequestCount,
    ['P50 (ms)'] = round(P50, 0),
    ['P75 (ms)'] = round(P75, 0),
    ['P90 (ms)'] = round(P90, 0),
    ['P95 (ms)'] = round(P95, 0),
    ['P99 (ms)'] = round(P99, 0),
    ['P99.9 (ms)'] = round(P999, 0),
    ['Max (ms)'] = round(Max, 0),
    P95ToP50Ratio,
    LatencySpread,
    PerformanceGrade
| order by P95 desc

Pattern 8: Database Query Performance Analysis

Identify slow queries and optimization opportunities:

// SQL Query Performance Analysis
let performanceWindow = 1h;
let slowQueryThreshold = 5000;  // ms

AppDependencies
| where TimeGenerated > ago(performanceWindow)
| where Type == "SQL"
| extend QueryHash = hash_sha256(Data)  // Group similar queries
| summarize 
    ExecutionCount = count(),
    AvgDuration = avg(DurationMs),
    P95Duration = percentile(DurationMs, 95),
    P99Duration = percentile(DurationMs, 99),
    MaxDuration = max(DurationMs),
    FailureCount = countif(Success == false),
    SampleQuery = any(Data)
    by Target, QueryHash
| extend 
    TotalTime = ExecutionCount * AvgDuration,
    SuccessRate = round(100.0 * (ExecutionCount - FailureCount) / ExecutionCount, 2)
| extend 
    ImpactScore = (TotalTime / 1000) * (100 - SuccessRate),  // High volume + slow + failures = high impact
    Recommendation = case(
        P95Duration > slowQueryThreshold and ExecutionCount > 100, "🎯 High Priority - Optimize Query",
        P95Duration > slowQueryThreshold, "Slow Query - Monitor",
        ExecutionCount > 1000 and AvgDuration > 100, "High Volume - Consider Caching",
        SuccessRate < 99, "❗ Reliability Issue",
        "✅ Performing Well"
    )
| project 
    Target,
    ExecutionCount,
    ['Avg (ms)'] = round(AvgDuration, 0),
    ['P95 (ms)'] = round(P95Duration, 0),
    ['P99 (ms)'] = round(P99Duration, 0),
    ['Max (ms)'] = round(MaxDuration, 0),
    ['Total Time (s)'] = round(TotalTime / 1000, 0),
    SuccessRate,
    ImpactScore = round(ImpactScore, 0),
    Recommendation,
    SampleQuery = substring(SampleQuery, 0, 100)
| order by ImpactScore desc

Capacity Planning Patterns

Pattern 9: Resource Growth Trend Analysis

Predict future capacity needs:

// CPU & Memory Growth Analysis
let historicalPeriod = 30d;
let forecastDays = 30;

Perf
| where TimeGenerated > ago(historicalPeriod)
| where CounterName in ("% Processor Time", "% Used Memory")
| summarize AvgValue = avg(CounterValue) by Computer, CounterName, bin(TimeGenerated, 1d)
| project TimeGenerated, Computer, CounterName, AvgValue
| make-series Value = avg(AvgValue) default = 0 
    on TimeGenerated 
    from ago(historicalPeriod) to now() 
    step 1d 
    by Computer, CounterName
| extend 
    (RSquare, SplitIdx, Variance, VarianceRatio, LinearFit) = series_fit_line(Value),
    Forecast = series_fit_line_dynamic(Value).linear_fit
| mv-expand TimeGenerated to typeof(datetime), Value to typeof(double), Forecast to typeof(double)
| extend 
    ForecastDate = TimeGenerated + forecastDays * 1d,
    ProjectedValue = Forecast + ((Forecast - Value) / datetime_diff('day', TimeGenerated, ago(historicalPeriod))) * forecastDays
| where TimeGenerated == now() or TimeGenerated == ago(historicalPeriod)
| summarize 
    CurrentValue = sumif(Value, TimeGenerated == now()),
    StartValue = sumif(Value, TimeGenerated == ago(historicalPeriod)),
    ProjectedValue = any(ProjectedValue)
    by Computer, CounterName
| extend 
    GrowthPercent = round(100.0 * (CurrentValue - StartValue) / StartValue, 2),
    DaysToCapacity = case(
        ProjectedValue >= 90, forecastDays,
        ProjectedValue >= 80, forecastDays * 2,
        forecastDays * 3
    ),
    Alert = case(
        ProjectedValue >= 90, "🔴 Critical - Scale Soon",
        ProjectedValue >= 80, "🟡 Warning - Plan Scaling",
        "🟢 Adequate Capacity"
    )
| project 
    Computer,
    Metric = CounterName,
    ['Current (%)'] = round(CurrentValue, 1),
    ['Projected 30d (%)'] = round(ProjectedValue, 1),
    ['Growth'] = strcat(GrowthPercent, "%"),
    DaysToCapacity,
    Alert
| order by ProjectedValue desc

Pattern 10: Storage Capacity Forecasting

Predict disk space exhaustion:

// Disk Space Forecasting
let historicalDays = 14;
let forecastDays = 30;
let criticalThreshold = 90.0;  // % used

Perf
| where TimeGenerated > ago(historicalDays * 1d)
| where ObjectName == "LogicalDisk"
| where CounterName == "% Free Space"
| where InstanceName !in ("_Total", "HarddiskVolume1")
| summarize AvgFreeSpace = avg(CounterValue) by Computer, InstanceName, bin(TimeGenerated, 1d)
| extend UsedSpace = 100 - AvgFreeSpace
| make-series UsedSpaceValues = avg(UsedSpace) default = 0 
    on TimeGenerated 
    from ago(historicalDays * 1d) to now() 
    step 1d 
    by Computer, InstanceName
| extend ForecastData = series_fit_line_dynamic(UsedSpaceValues)
| extend Slope = ForecastData.slope
| extend CurrentUsage = UsedSpaceValues[-1]
| extend DaysToFull = iff(
    Slope <= 0, 
    999,  // Not growing
    (100 - CurrentUsage) / Slope
)
| extend 
    EstimatedFullDate = iff(
        DaysToFull < 999,
        now() + DaysToFull * 1d,
        datetime(null)
    ),
    Alert = case(
        DaysToFull < 7, "🔴 Critical - < 7 days",
        DaysToFull < 30, "🟡 Warning - < 30 days",
        DaysToFull < 90, "ℹ️ Monitor - < 90 days",
        "🟢 Healthy"
    )
| project 
    Computer,
    Drive = InstanceName,
    ['Current Usage (%)'] = round(CurrentUsage, 1),
    ['Growth Rate (%/day)'] = round(Slope, 2),
    DaysToFull = round(DaysToFull, 0),
    EstimatedFullDate,
    Alert
| where DaysToFull < 90
| order by DaysToFull asc

Troubleshooting Patterns

Pattern 11: Error Spike Investigation

Quickly investigate sudden error increases:

// Error Spike Investigation with Context
let currentWindow = 15m;
let comparisonWindow = 1h;
let spikeThreshold = 2.0;  // 2x increase

// Current error rate
let currentErrors = AppExceptions
| where TimeGenerated > ago(currentWindow)
| summarize CurrentCount = count();

// Baseline error rate
let baselineErrors = AppExceptions
| where TimeGenerated between (ago(comparisonWindow + currentWindow) .. ago(currentWindow))
| summarize BaselineCount = count() / (datetime_diff('minute', comparisonWindow, 0m) / datetime_diff('minute', currentWindow, 0m));

// Check for spike
let spikeDetection = currentErrors
| extend BaselineCount = toscalar(baselineErrors)
| extend 
    SpikeRatio = round(todouble(CurrentCount) / BaselineCount, 2),
    IsSpike = (todouble(CurrentCount) / BaselineCount) > spikeThreshold;

// If spike detected, get details
spikeDetection
| where IsSpike == true
| join kind=inner (
    AppExceptions
    | where TimeGenerated > ago(currentWindow)
    | summarize 
        Count = count(),
        AffectedUsers = dcount(UserId),
        SampleMessage = any(OuterMessage),
        FirstOccurrence = min(TimeGenerated)
        by Type, AppRoleName
    | top 10 by Count
) on $left.dummy == $right.dummy  // Cross join to get details if spike occurred
| project 
    SpikeDetected = "🚨 Error Spike Detected",
    CurrentCount,
    BaselineCount,
    SpikeRatio,
    ErrorType = Type,
    Service = AppRoleName,
    ErrorCount = Count,
    AffectedUsers,
    FirstOccurrence,
    SampleMessage
| order by ErrorCount desc

Pattern 12: Deployment Impact Analysis

Analyze impact of recent deployments:

// Post-Deployment Health Check
let deploymentTime = datetime('2024-01-15T10:30:00Z');  // Replace with actual deployment time
let preDeploymentWindow = 1h;
let postDeploymentWindow = 1h;

// Compare metrics before and after deployment
let preDeployment = AppRequests
| where TimeGenerated between ((deploymentTime - preDeploymentWindow) .. deploymentTime)
| summarize 
    PreRequestCount = count(),
    PreErrorCount = countif(Success == false),
    PreAvgDuration = avg(DurationMs),
    PreP95Duration = percentile(DurationMs, 95)
| extend Period = "Before";

let postDeployment = AppRequests
| where TimeGenerated between (deploymentTime .. (deploymentTime + postDeploymentWindow))
| summarize 
    PostRequestCount = count(),
    PostErrorCount = countif(Success == false),
    PostAvgDuration = avg(DurationMs),
    PostP95Duration = percentile(DurationMs, 95)
| extend Period = "After";

// Calculate changes
preDeployment
| extend dummy = 1
| join kind=inner (postDeployment | extend dummy = 1) on dummy
| extend 
    ErrorRateChange = round(100.0 * ((todouble(PostErrorCount) / PostRequestCount) - (todouble(PreErrorCount) / PreRequestCount)), 3),
    LatencyChange = round(100.0 * (PostP95Duration - PreP95Duration) / PreP95Duration, 2),
    TrafficChange = round(100.0 * (PostRequestCount - PreRequestCount) / PreRequestCount, 2),
    DeploymentHealth = case(
        ErrorRateChange > 1 or LatencyChange > 20, "🔴 Degradation Detected",
        ErrorRateChange > 0.1 or LatencyChange > 10, "🟡 Minor Impact",
        "🟢 Healthy Deployment"
    )
| project 
    DeploymentTime = deploymentTime,
    ['Pre Error Rate (%)'] = round(100.0 * PreErrorCount / PreRequestCount, 2),
    ['Post Error Rate (%)'] = round(100.0 * PostErrorCount / PostRequestCount, 2),
    ['Error Rate Change'] = strcat(ErrorRateChange, "%"),
    ['Pre P95 Latency (ms)'] = round(PreP95Duration, 0),
    ['Post P95 Latency (ms)'] = round(PostP95Duration, 0),
    ['Latency Change'] = strcat(LatencyChange, "%"),
    ['Traffic Change'] = strcat(TrafficChange, "%"),
    DeploymentHealth

Advanced Analytics Patterns

Pattern 13: User Journey Analysis

Track user behavior patterns:

// User Journey Through Application
let sessionWindow = 30m;

AppRequests
| where TimeGenerated > ago(6h)
| where isnotempty(UserId)
| extend SessionId = strcat(UserId, "_", bin(TimeGenerated, sessionWindow))
| summarize 
    Actions = make_list(OperationName),
    Timestamps = make_list(TimeGenerated),
    TotalDuration = sum(DurationMs),
    ErrorCount = countif(Success == false)
    by SessionId, UserId
| extend 
    SessionLength = array_length(Actions),
    Journey = strcat(Actions[0], " → ", Actions[array_length(Actions)-1]),
    HasErrors = ErrorCount > 0,
    AvgActionDuration = TotalDuration / SessionLength
| where SessionLength > 1  // Filter single-action sessions
| summarize 
    SessionCount = count(),
    ErrorSessionCount = countif(HasErrors),
    AvgSessionLength = avg(SessionLength),
    CommonJourney = any(Journey)
    by strcat(Actions[0], " → ", Actions[array_length(Actions)-1])
| extend ErrorRate = round(100.0 * ErrorSessionCount / SessionCount, 2)
| top 20 by SessionCount desc
| project 
    Journey = CommonJourney,
    ['Session Count'] = SessionCount,
    ['Avg Steps'] = round(AvgSessionLength, 1),
    ['Error Rate (%)'] = ErrorRate

Pattern 14: Correlation Analysis Between Metrics

Find correlations between different performance metrics:

// Correlation Between CPU Usage and Application Errors
let timeGrain = 5m;
let analysisWindow = 6h;

// Get CPU data
let cpuData = Perf
| where TimeGenerated > ago(analysisWindow)
| where CounterName == "% Processor Time"
| summarize AvgCpu = avg(CounterValue) by bin(TimeGenerated, timeGrain);

// Get error data
let errorData = AppExceptions
| where TimeGenerated > ago(analysisWindow)
| summarize ErrorCount = count() by bin(TimeGenerated, timeGrain);

// Join and analyze correlation
cpuData
| join kind=fullouter (errorData) on TimeGenerated
| extend 
    TimeGenerated = coalesce(TimeGenerated, TimeGenerated1),
    AvgCpu = coalesce(AvgCpu, 0.0),
    ErrorCount = coalesce(ErrorCount, 0)
| project TimeGenerated, AvgCpu, ErrorCount
| order by TimeGenerated asc
| serialize 
| extend 
    CpuZ = (AvgCpu - avg(AvgCpu)) / stdev(AvgCpu),
    ErrorZ = (ErrorCount - avg(ErrorCount)) / stdev(ErrorCount)
| extend CorrelationScore = CpuZ * ErrorZ
| summarize 
    AvgCorrelation = avg(CorrelationScore),
    HighCorrelationPeriods = countif(abs(CorrelationScore) > 1),
    TotalPeriods = count()
| extend 
    CorrelationStrength = case(
        abs(AvgCorrelation) > 0.7, "Strong Correlation",
        abs(AvgCorrelation) > 0.4, "Moderate Correlation",
        abs(AvgCorrelation) > 0.2, "Weak Correlation",
        "No Correlation"
    ),
    Direction = case(
        AvgCorrelation > 0, "Positive",
        AvgCorrelation < 0, "Negative",
        "None"
    )
| project 
    CorrelationStrength,
    Direction,
    ['Correlation Score'] = round(AvgCorrelation, 3),
    ['High Correlation Periods'] = HighCorrelationPeriods,
    ['Total Periods'] = TotalPeriods,
    ['% High Correlation'] = round(100.0 * HighCorrelationPeriods / TotalPeriods, 1)

Key Takeaways

Golden signals (latency, traffic, errors, saturation) provide comprehensive service health
SLO tracking with error budgets enables data-driven reliability decisions
Anomaly detection with baselines catches issues before they impact users
Security patterns help identify and respond to threats quickly
Performance analysis beyond averages reveals hidden issues
Capacity planning prevents resource exhaustion
Deployment impact analysis ensures safe releases
Correlation analysis uncovers relationships between metrics

Conclusion

Throughout this KQL series, we've progressed from basic queries to production-ready observability patterns. The key to mastery is practice and iteration - start with simple queries, understand your data, and gradually build more sophisticated analysis.

Remember:

Always filter by time first
Understand your data schemas
Optimize for performance
Document your queries
Build reusable patterns
Share knowledge with your team

Keep querying, keep learning, and use these patterns to build robust observability into your systems. KQL is a powerful tool - wield it wisely!

Additional Resources

Microsoft KQL documentation
Azure Monitor documentation
Log Analytics workspace best practices
Community query repositories
Azure Monitor Workbooks gallery

Happy querying! 🚀

PreviousPart 6: KQL Best Practices and Performance Optimization NextCloudWatch Query 101

Last updated 14 hours ago

hashtagFrom Learning to Practice

hashtagProduction Monitoring Patterns

hashtagPattern 1: Golden Signals Monitoring

hashtagPattern 2: Service Level Objectives (SLO) Tracking

hashtagPattern 3: Anomaly Detection with Baseline

hashtagPattern 4: Dependency Health Matrix

hashtagSecurity Monitoring Patterns

hashtagPattern 5: Failed Authentication Analysis

hashtagPattern 6: Security Event Correlation

hashtagPerformance Analysis Patterns

hashtagPattern 7: Response Time Percentile Distribution

hashtagPattern 8: Database Query Performance Analysis

hashtagCapacity Planning Patterns

hashtagPattern 9: Resource Growth Trend Analysis

hashtagPattern 10: Storage Capacity Forecasting

hashtagTroubleshooting Patterns

hashtagPattern 11: Error Spike Investigation

hashtagPattern 12: Deployment Impact Analysis

hashtagAdvanced Analytics Patterns

hashtagPattern 13: User Journey Analysis

hashtagPattern 14: Correlation Analysis Between Metrics

hashtagKey Takeaways

hashtagConclusion

hashtagAdditional Resources