Part 6: KQL Best Practices and Performance Optimization

Learning to Write Efficient Queries

Early in my KQL journey, I wrote queries that worked but didn't scale. Through production experience and painful lessons, I learned how to write efficient, maintainable queries. In this part, I'll share optimization techniques that made my queries 10x faster.

Query Performance Fundamentals

Understanding Query Execution

KQL queries execute in a columnar data store optimized for analytics. Understanding this helps write better queries.

Key concepts I learned:

Time-based partitioning - Data is partitioned by time
Column-store optimization - Only requested columns are scanned
Query caching - Repeated queries are cached
Data distribution - Data is distributed across nodes

Measuring Query Performance

Every query returns performance statistics. Here's what I monitor:

// Example query with statistics  
Perf
| where TimeGenerated > ago(24h)
| where CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer

Check the query statistics after execution:

Records scanned: Total rows examined
Execution time: Total query duration
Data processed: Amount of data read
CPU time: Computation time used

Goal: Minimize all these metrics while getting the results you need.

Performance Optimization Techniques

1. Time Filtering - Always First

This is the #1 optimization rule I follow.

Bad:

Perf
| where Computer == "web-server-01"
| where CounterName == "% Processor Time"
| where TimeGenerated > ago(1h)  // Too late!

Good:

Perf
| where TimeGenerated > ago(1h)  // Filter first!
| where Computer == "web-server-01"
| where CounterName == "% Processor Time"

Why: Time filtering enables partition pruning. Without it, the query scans all data.

Real impact from my experience:

Bad query: 15 seconds, 100M rows scanned
Good query: 0.5 seconds, 2M rows scanned

2. Use Appropriate String Operators

String operations have vastly different performance characteristics.

Performance hierarchy (fastest to slowest):

// 1. Equality (fastest)
| where Level == "Error"

// 2. has - word boundary match
| where Message has "error"  // Matches whole words only

// 3. contains - substring match
| where Message contains "error"  // Case-insensitive substring

// 4. startswith / endswith
| where Message startswith "ERROR:"

// 5. in operator - set membership
| where Level in ("Error", "Critical", "Warning")

// 6. matches regex (slowest - use sparingly)
| where Message matches regex "E[0-9]{3}: .*"

Real example from my work:

// Slow (3.2 seconds)
AzureActivity
| where TimeGenerated > ago(24h)
| where Message matches regex ".*delete.*"

// Fast (0.4 seconds)
AzureActivity
| where TimeGenerated > ago(24h)
| where Message has "delete"

3. Project Early to Reduce Data Volume

Select only needed columns early in the query.

Bad:

AzureActivity
| where TimeGenerated > ago(24h)
| join kind=inner (
    AzureDiagnostics  // Carries all columns
    | where TimeGenerated > ago(24h)
) on $left.ResourceId == $right.Resource
| project TimeGenerated, Caller, OperationNameValue  // Too late

Good:

AzureActivity
| where TimeGenerated > ago(24h)
| project TimeGenerated, Caller, OperationNameValue, ResourceId  // Early projection
| join kind=inner (
    AzureDiagnostics
    | where TimeGenerated > ago(24h)
    | project TimeGenerated, Resource, Category  // Project in subquery too
) on $left.ResourceId == $right.Resource

4. Optimize Joins

Joins are expensive. Here's how I optimize them:

Tip 1: Filter before joining

// Bad - Join then filter
AzureActivity
| join kind=inner (
    Heartbeat
) on Computer
| where TimeGenerated > ago(1h)  // Filters after join

// Good - Filter before join
AzureActivity
| where TimeGenerated > ago(1h)
| join kind=inner (
    Heartbeat
    | where TimeGenerated > ago(1h)
) on Computer

Tip 2: Put smaller table on right side

// Optimize join order - smaller table on right
let SmallSet = Heartbeat
| where TimeGenerated > ago(5m)
| distinct Computer;  // Small set
AzureActivity
| where TimeGenerated > ago(1h)  // Larger set
| join kind=inner (SmallSet) on Computer

Tip 3: Use appropriate join kind

// innerunique (default) - Best for most cases
// Deduplicates right side automatically
| join kind=innerunique (RightTable) on Key

// inner - When you know right side has unique keys
| join kind=inner (RightTable) on Key

// leftouter - When you need all left records
| join kind=leftouter (RightTable) on Key

5. Use summarize Instead of distinct

When you just need unique values with counts:

Slower:

AzureActivity
| where TimeGenerated > ago(24h)
| distinct Computer
| count

Faster:

AzureActivity
| where TimeGenerated > ago(24h)
| summarize by Computer
| count

Even better when you need counts:

AzureActivity
| where TimeGenerated > ago(24h)
| summarize dcount(Computer)  // Much faster!

6. Optimize Time Binning

Use appropriate bin sizes for your time range.

// Don't use 1-minute bins for 30-day queries
// Bad - 43,200 time buckets
Perf
| where TimeGenerated > ago(30d)
| summarize avg(CounterValue) by bin(TimeGenerated, 1m)

// Good - 720 time buckets
Perf
| where TimeGenerated > ago(30d)
| summarize avg(CounterValue) by bin(TimeGenerated, 1h)

My bin size guidelines:

Last hour: 1-5 minute bins
Last day: 5-15 minute bins
Last week: 1 hour bins
Last month: 4-6 hour bins
Last year: 1 day bins

7. Limit Results During Development

Always use take or limit when developing queries.

// Development query
AzureActivity
| where TimeGenerated > ago(30d)
| take 100  // Remove before production
| where ResourceGroup == "production-rg"
| summarize count() by OperationNameValue

8. Use let for Complex Calculations

Reuse expensive calculations with let statements.

Bad - Calculated multiple times:

Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| extend ThresholdCheck1 = iff(CounterValue > percentile_tdigest(tdigest(CounterValue), 95), 1, 0)
| extend ThresholdCheck2 = iff(CounterValue > percentile_tdigest(tdigest(CounterValue), 95), "High", "Normal")

Good - Calculated once:

let cpuData = Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time";
let threshold = cpuData | summarize P95 = percentile(CounterValue, 95);
cpuData
| extend 
    ThresholdCheck1 = iff(CounterValue > toscalar(threshold), 1, 0),
    ThresholdCheck2 = iff(CounterValue > toscalar(threshold), "High", "Normal")

9. Avoid Cartesian Products

Be careful with joins without proper keys.

Bad - Cartesian product:

AzureActivity
| join kind=inner (
    Heartbeat
) on $left.TenantId == $right.TenantId  // Same tenant, creates massive join

Good - Proper key:

AzureActivity
| join kind=inner (
    Heartbeat
) on Computer  // Specific key

10. Use Column Existence Checks

When working with dynamic schemas:

// Safe column access
AzureDiagnostics
| extend SafeValue = column_ifexists("properties_statusCode_d", 0)
| where SafeValue >= 400

Query Structure Best Practices

Template for Well-Structured Queries

Here's my standard query template:

// 1. Documentation comment
// Purpose: Monitor high CPU servers
// Owner: SRE Team
// Last updated: 2024-01-15

// 2. Define variables and parameters
let timeRange = 1h;
let cpuThreshold = 80.0;
let targetEnvironment = "production";

// 3. Define reusable datasets
let activeServers = Heartbeat
| where TimeGenerated > ago(5m)
| distinct Computer;

// 4. Main query with clear sections
Perf
| where TimeGenerated > ago(timeRange)  // Time filter first
| where Computer in (activeServers)     // Additional filters
| where CounterName == "% Processor Time"
| summarize AvgCpu = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| where AvgCpu > cpuThreshold
| extend 
    Severity = case(
        AvgCpu > 95, "Critical",
        AvgCpu > 90, "High",
        "Medium"
    ),
    Environment = targetEnvironment
| project TimeGenerated, Computer, AvgCpu, Severity, Environment
| order by TimeGenerated desc, AvgCpu desc

Query Organization

1. Use meaningful variable names:

// Bad
let t = ago(1h);
let x = 80;

// Good
let timeWindow = ago(1h);
let cpuThreshold = 80;

2. Break complex queries into steps:

// Step 1: Get baseline data
let baselineData = Perf
| where TimeGenerated between (ago(7d) .. ago(1d))
| where CounterName == "% Processor Time"
| summarize BaselineAvg = avg(CounterValue) by Computer;

// Step 2: Get current data
let currentData = Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| summarize CurrentAvg = avg(CounterValue) by Computer;

// Step 3: Compare and identify anomalies
currentData
| join kind=inner (baselineData) on Computer
| extend PercentChange = 100.0 * (CurrentAvg - BaselineAvg) / BaselineAvg
| where abs(PercentChange) > 50
| project Computer, CurrentAvg, BaselineAvg, PercentChange
| order by abs(PercentChange) desc

3. Comment complex logic:

Perf
| where TimeGenerated > ago(24h)
// Calculate Z-score for anomaly detection
// Z-score = (Value - Mean) / StdDev
// Values > 3 or < -3 are considered anomalies
| summarize 
    Avg = avg(CounterValue),
    StdDev = stdev(CounterValue)
    by Computer
| extend ZScore = (Avg - Avg) / StdDev
| where abs(ZScore) > 3

Common Anti-Patterns to Avoid

Anti-Pattern 1: Filtering After Aggregation

// Bad
AzureActivity
| summarize count() by ResourceGroup, bin(TimeGenerated, 1h)
| where TimeGenerated > ago(24h)  // Should be before summarize

// Good
AzureActivity
| where TimeGenerated > ago(24h)
| summarize count() by ResourceGroup, bin(TimeGenerated, 1h)

Anti-Pattern 2: Unnecessary Data Scans

// Bad - Scans all columns
AzureActivity
| where TimeGenerated > ago(1h)
| where Message contains "error"
// ... but only uses ResourceId later

// Good - Project early
AzureActivity
| where TimeGenerated > ago(1h)
| project TimeGenerated, ResourceId, Message
| where Message contains "error"

Anti-Pattern 3: Multiple Similar Queries

// Bad - Run separately
let errors = AppTraces | where Level == "Error" | count;
let warnings = AppTraces | where Level == "Warning" | count;
let info = AppTraces | where Level == "Info" | count;

// Good - Single query
AppTraces
| summarize count() by Level
| evaluate pivot(Level)

Anti-Pattern 4: Not Using Cached Results

// Execute once if possible, or use let
let expensiveCalc = Perf
| where TimeGenerated > ago(30d)
| summarize percentile(CounterValue, 95) by Computer;

// Reuse results
expensiveCalc | where percentile_CounterValue_95 > 80

Testing and Validation

Query Testing Checklist

Before deploying queries to production dashboards or alerts:

1. Test with different time ranges:

// Test with small range first
| where TimeGenerated > ago(1h)

// Then expand
| where TimeGenerated > ago(24h)
| where TimeGenerated > ago(7d)

2. Verify null handling:

// Check for null values
| summarize 
    TotalRows = count(),
    NullValues = countif(isempty(ColumnName))
    by Computer

3. Test edge cases:

No data scenarios
Single record scenarios
Zero division situations
Missing columns

4. Validate aggregations:

// Sanity check: Does sum match detail?
let summary = Perf
| where TimeGenerated > ago(1h)
| summarize Total = count();

let detail = Perf
| where TimeGenerated > ago(1h)
| count;

summary | extend DetailCount = toscalar(detail)

Performance Benchmarking

Compare query variations:

// Variation 1
Perf
| where TimeGenerated > ago(24h)
| where CounterName has "Processor"
| summarize avg(CounterValue) by Computer
// Note execution time: X seconds

// Variation 2
Perf
| where TimeGenerated > ago(24h)
| where CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer
// Note execution time: Y seconds
// Choose the faster approach

Maintenance and Documentation

Query Documentation Template

I use this template for important queries:

// ==================================================================
// Query: High CPU Alert - Production Servers
// ==================================================================
// Purpose: Identify servers with CPU usage > 80% in last 15 minutes
// Owner: SRE Team ([email protected])
// Created: 2024-01-01
// Last Modified: 2024-01-15
// Schedule: Every 5 minutes
// Alert Threshold: > 80% for 15 minutes
// Runbook: https://wiki.company.com/runbooks/high-cpu
// ==================================================================

let timeWindow = 15m;
let threshold = 80.0;

Perf
| where TimeGenerated > ago(timeWindow)
| where CounterName == "% Processor Time"
| summarize AvgCpu = avg(CounterValue) by Computer
| where AvgCpu > threshold
| extend 
    Severity = case(AvgCpu > 95, "P1", AvgCpu > 90, "P2", "P3"),
    AlertTime = now()
| project AlertTime, Computer, AvgCpu, Severity

Version Control for Queries

Store important queries in version control:

queries/
├── alerts/
│   ├── high-cpu-alert.kql
│   ├── error-rate-alert.kql
│   └── disk-space-alert.kql
├── dashboards/
│   ├── application-health.kql
│   └── infrastructure-overview.kql
└── reports/
    ├── daily-summary.kql
    └── weekly-analysis.kql

Cost Optimization

Understanding Query Cost

Log Analytics charges based on data ingestion and retention, but query execution affects performance and user experience.

Reduce query cost by:

Shorter time ranges when possible
Fewer columns in results
Efficient filtering to reduce scans
Appropriate aggregation levels

Data Retention Strategy

Balance cost and requirements:

// Check data volume by table
union withsource=TableName *
| where TimeGenerated > ago(30d)
| summarize DataSizeMB = sum(_BilledSize) / 1024 / 1024 by TableName
| order by DataSizeMB desc

My retention guidelines:

Critical logs: 90-180 days
Performance metrics: 90 days
Debug logs: 30 days
Verbose logs: 7 days

Real-World Optimization Case Study

Before Optimization:

// Slow query: 45 seconds, 500M rows scanned
AzureActivity
| join kind=inner (
    AzureDiagnostics
    | extend ResourceId = Resource
) on ResourceId
| where TimeGenerated > ago(24h)
| where ResourceGroup == "production-rg"
| extend ResourceType = split(ResourceId, "/")[6]
| where ResourceType == "virtualMachines"
| summarize count() by OperationNameValue
| order by count_ desc

After Optimization:

// Fast query: 2 seconds, 5M rows scanned
let targetResourceGroup = "production-rg";
let resourceTypeFilter = "virtualMachines";

AzureActivity
| where TimeGenerated > ago(24h)  // Filter first
| where ResourceGroup == targetResourceGroup  // Filter early
| extend ResourceType = split(ResourceId, "/")[6]
| where ResourceType == resourceTypeFilter
| project TimeGenerated, OperationNameValue, ResourceId  // Project early
| join kind=inner (
    AzureDiagnostics
    | where TimeGenerated > ago(24h)  // Filter in subquery
    | where ResourceGroup == targetResourceGroup
    | project Resource, Category  // Project in subquery
    | where Category == "AuditEvent"
) on $left.ResourceId == $right.Resource
| summarize count() by OperationNameValue
| order by count_ desc

Optimization techniques applied:

Time filtering first
Added filters before join
Project columns early
Filter in subqueries
Used let for constants

Key Takeaways

Always filter by time first - enables partition pruning
Use appropriate string operators: has > contains > regex
Project columns early to reduce data volume
Optimize joins by filtering first and ordering correctly
Use summarize instead of distinct when possible
Size time bins appropriately for your range
Document complex queries thoroughly
Test queries with various time ranges
Benchmark query variations
Use let for reusable calculations
Avoid common anti-patterns
Monitor query statistics

In Part 7, we'll apply everything we've learned to real-world production scenarios: anomaly detection, security monitoring, capacity planning, and advanced observability patterns.

PreviousPart 5: Building Observability Dashboards with KQL NextPart 7: Real-World KQL Patterns and Production Use Cases

Last updated 14 hours ago

hashtagLearning to Write Efficient Queries

hashtagQuery Performance Fundamentals

hashtagUnderstanding Query Execution

hashtagMeasuring Query Performance

hashtagPerformance Optimization Techniques

hashtag1. Time Filtering - Always First

hashtag2. Use Appropriate String Operators

hashtag3. Project Early to Reduce Data Volume

hashtag4. Optimize Joins

hashtag5. Use summarize Instead of distinct

hashtag6. Optimize Time Binning

hashtag7. Limit Results During Development

hashtag8. Use let for Complex Calculations

hashtag9. Avoid Cartesian Products

hashtag10. Use Column Existence Checks

hashtagQuery Structure Best Practices

hashtagTemplate for Well-Structured Queries

hashtagQuery Organization

hashtagCommon Anti-Patterns to Avoid

hashtagAnti-Pattern 1: Filtering After Aggregation

hashtagAnti-Pattern 2: Unnecessary Data Scans

hashtagAnti-Pattern 3: Multiple Similar Queries

hashtagAnti-Pattern 4: Not Using Cached Results

hashtagTesting and Validation

hashtagQuery Testing Checklist

hashtagPerformance Benchmarking

hashtagMaintenance and Documentation

hashtagQuery Documentation Template

hashtagVersion Control for Queries

hashtagCost Optimization

hashtagUnderstanding Query Cost

hashtagData Retention Strategy

hashtagReal-World Optimization Case Study

hashtagBefore Optimization:

hashtagAfter Optimization:

hashtagKey Takeaways