Part 3: Advanced Query Operators and Functions

Moving Beyond the Basics

After mastering fundamental operators, I discovered that KQL's advanced capabilities truly shine when analyzing complex scenarios across multiple data sources. In this part, I'll share the advanced operators and functions that transformed my observability practice.

join - Combining Data from Multiple Tables

The join operator is essential when correlating data across different log sources. Unlike SQL joins, KQL joins are optimized for time-series data.

Join Types I Use

1. innerunique (default) - Most common in my queries:

// Correlate VM performance with heartbeat data
Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| summarize AvgCpu = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| join kind=innerunique (
    Heartbeat
    | where TimeGenerated > ago(1h)
    | summarize LastHeartbeat = max(TimeGenerated) by Computer
) on Computer
| project Computer, AvgCpu, LastHeartbeat

2. leftouter - When I need all left-side records:

// Find all VMs and check if they have recent errors
Heartbeat
| where TimeGenerated > ago(1h)
| distinct Computer
| join kind=leftouter (
    Event
    | where TimeGenerated > ago(1h)
    | where EventLevelName == "Error"
    | summarize ErrorCount = count() by Computer
) on Computer
| extend ErrorCount = iif(isempty(ErrorCount), 0, ErrorCount)
| order by ErrorCount desc

3. rightouter - Less common but useful:

// Show errors even for VMs without heartbeat
Event
| where TimeGenerated > ago(1h)
| where EventLevelName == "Error"
| summarize ErrorCount = count() by Computer
| join kind=rightouter (
    Heartbeat
    | where TimeGenerated > ago(1h)
    | distinct Computer
) on Computer

4. inner - Traditional inner join:

// VMs with both high CPU and memory pressure
Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| where CounterValue > 80
| distinct Computer
| join kind=inner (
    Perf
    | where TimeGenerated > ago(1h)
    | where CounterName == "% Used Memory"
    | where CounterValue > 80
    | distinct Computer
) on Computer

Real-World Join Pattern: Correlating Application and Infrastructure

This is a pattern I use frequently to correlate application errors with infrastructure issues:

// Find application errors and check corresponding server health
AppTraces
| where TimeGenerated > ago(1h)
| where Level == "Error"
| extend Computer = tostring(split(AppRoleInstance, "_")[0])
| summarize 
    ErrorCount = count(),
    SampleError = any(Message)
    by Computer, bin(TimeGenerated, 5m)
| join kind=leftouter (
    Perf
    | where TimeGenerated > ago(1h)
    | where CounterName == "% Processor Time"
    | summarize AvgCpu = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
) on Computer, TimeGenerated
| project TimeGenerated, Computer, ErrorCount, SampleError, AvgCpu
| where ErrorCount > 10
| order by TimeGenerated desc

Join Performance Tips from Experience

// Good - Filter before joining (reduces data processed)
AzureActivity
| where TimeGenerated > ago(1h)
| where ResourceGroup == "production-rg"
| summarize count() by Resource
| join kind=inner (
    AzureDiagnostics
    | where TimeGenerated > ago(1h)
    | where ResourceGroup == "production-rg"
    | summarize ErrorCount = countif(Level == "Error") by Resource
) on Resource

// Less optimal - Filter after joining
AzureActivity
| summarize count() by Resource, ResourceGroup
| join kind=inner (
    AzureDiagnostics
    | summarize ErrorCount = countif(Level == "Error") by Resource, ResourceGroup
) on Resource
| where ResourceGroup == "production-rg"  // Too late

union - Combining Similar Tables

Use union when you need to query across multiple tables or workspaces.

Basic Union:

// Search across multiple log types
union AppTraces, AppExceptions, AppRequests
| where TimeGenerated > ago(1h)
| where AppRoleName == "api-service"
| summarize count() by Type = $table

Union with Wildcards:

// Query all custom log tables
union withsource=TableName Custom*
| where TimeGenerated > ago(24h)
| summarize count() by TableName

Cross-Workspace Queries:

// Query multiple Log Analytics workspaces
union 
    workspace("prod-workspace").Heartbeat,
    workspace("dev-workspace").Heartbeat,
    workspace("test-workspace").Heartbeat
| where TimeGenerated > ago(5m)
| summarize ComputerCount = dcount(Computer) by WorkspaceId = tostring(split(_ResourceId, "/")[8])

Real Pattern: Unified Error Dashboard

This is how I build a unified error view across all application components:

// Combine errors from multiple sources
union 
    (AppTraces | where Level == "Error" | extend Source = "AppTraces", ErrorMessage = Message),
    (AppExceptions | extend Source = "AppExceptions", ErrorMessage = OuterMessage),
    (AzureDiagnostics | where Level == "Error" | extend Source = "AzureDiagnostics", ErrorMessage = Message_s)
| where TimeGenerated > ago(1h)
| summarize 
    ErrorCount = count(),
    SampleError = any(ErrorMessage)
    by Source, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

mv-expand - Expanding Multi-Value Fields

mv-expand is crucial when working with arrays or dynamic fields. I use it frequently with Azure resource logs.

Expanding Arrays:

// Parse and expand array fields
AzureDiagnostics
| where TimeGenerated > ago(1h)
| where ResourceType == "APPLICATIONGATEWAYS"
| extend Rules = parse_json(ruleName_s)
| mv-expand Rules
| summarize count() by tostring(Rules)

Real Pattern: Analyzing Tags

I use this pattern to analyze Azure resource tags:

AzureActivity
| where TimeGenerated > ago(7d)
| extend Tags = parse_json(Claims_d)
| mv-expand Tags
| extend TagKey = tostring(bag_keys(Tags)[0])
| extend TagValue = tostring(Tags[TagKey])
| where isnotempty(TagKey)
| summarize ResourceCount = dcount(Resource) by TagKey, TagValue
| order by ResourceCount desc

Expanding Nested JSON:

// Parse nested JSON from custom logs
CustomLog_CL
| where TimeGenerated > ago(1h)
| extend ParsedData = parse_json(RawData)
| mv-expand Events = ParsedData.events
| extend 
    EventType = tostring(Events.type),
    EventTime = todatetime(Events.timestamp),
    EventData = Events.data
| project TimeGenerated, EventType, EventTime, EventData

parse - Extracting Structured Data

The parse operator extracts fields from strings using patterns.

Simple Parse:

// Extract fields from log message
AppTraces
| where Message contains "Request completed"
| parse Message with "Request completed in " Duration:long "ms with status " Status:int
| project TimeGenerated, Duration, Status
| where Duration > 1000

Multiple Parse Patterns:

// Handle different log formats
AzureDiagnostics
| extend ParsedMessage = case(
    Message startswith "User", parse_user(Message),
    Message startswith "Error", parse_error(Message),
    parse_generic(Message)
)

Real Pattern: Parsing API Gateway Logs

This is how I parse Application Gateway logs:

AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
| where Category == "ApplicationGatewayAccessLog"
| parse requestUri_s with Protocol "://" Host "/" Path
| extend 
    StatusGroup = case(
        httpStatus_d < 300, "Success",
        httpStatus_d < 400, "Redirect",
        httpStatus_d < 500, "Client Error",
        "Server Error"
    )
| summarize 
    RequestCount = count(),
    AvgTimeTaken = avg(timeTaken_d)
    by Host, StatusGroup
| order by RequestCount desc

make-series - Time Series Analysis

make-series is powerful for creating time-series data with automatic gap filling.

Basic Time Series:

// CPU usage time series with 5-minute intervals
Perf
| where TimeGenerated > ago(24h)
| where CounterName == "% Processor Time"
| make-series AvgCpu = avg(CounterValue) 
    default = 0 
    on TimeGenerated 
    step 5m 
    by Computer
| render timechart

Time Series with Gap Filling:

// Request rate with automatic gap filling
AppRequests
| where TimeGenerated > ago(1h)
| make-series 
    RequestRate = count() default = 0,
    AvgDuration = avg(DurationMs) default = 0
    on TimeGenerated 
    step 1m
| render timechart

Real Pattern: Anomaly Detection

I combine make-series with series_decompose_anomalies for anomaly detection:

// Detect CPU usage anomalies
Perf  
| where TimeGenerated > ago(7d)
| where CounterName == "% Processor Time"
| where Computer == "web-server-01"
| make-series CpuUsage = avg(CounterValue) 
    default = 0
    on TimeGenerated 
    step 5m
| extend (anomalies, score, baseline) = series_decompose_anomalies(CpuUsage, 1.5, -1, 'linefit')
| mv-expand TimeGenerated, CpuUsage, anomalies, score, baseline
| where anomalies != 0
| project TimeGenerated, CpuUsage = toreal(CpuUsage), AnomalyScore = toreal(score)

let - Creating Variables and Functions

let statements make complex queries more readable and reusable.

Variable Definition:

let StartTime = ago(24h);
let EndTime = now();
let HighCpuThreshold = 80.0;
Perf
| where TimeGenerated between (StartTime .. EndTime)
| where CounterName == "% Processor Time"
| where CounterValue > HighCpuThreshold
| summarize AvgCpu = avg(CounterValue) by Computer

Tabular Variables:

let ProdServers = 
    Heartbeat
    | where TimeGenerated > ago(5m)
    | where ComputerEnvironment == "Azure"
    | distinct Computer;
Perf
| where TimeGenerated > ago(1h)
| where Computer in (ProdServers)
| summarize avg(CounterValue) by Computer, CounterName

Functions with let:

let GetHighCpuServers = (threshold: real) {
    Perf
    | where TimeGenerated > ago(1h)
    | where CounterName == "% Processor Time"
    | summarize AvgCpu = avg(CounterValue) by Computer
    | where AvgCpu > threshold
};
GetHighCpuServers(80.0)
| join kind=inner (
    Heartbeat
    | where TimeGenerated > ago(5m)
    | project Computer, OSType, ComputerIP
) on Computer

Real Pattern: Reusable Time Windows

This is a pattern I use across many queries:

let timeRange = 24h;
let timeGrain = 5m;
let criticalThreshold = 90.0;
let warningThreshold = 80.0;
Perf
| where TimeGenerated > ago(timeRange)
| where CounterName == "% Processor Time"
| summarize 
    AvgCpu = avg(CounterValue),
    MaxCpu = max(CounterValue)
    by Computer, bin(TimeGenerated, timeGrain)
| extend Severity = case(
    MaxCpu >= criticalThreshold, "Critical",
    MaxCpu >= warningThreshold, "Warning",
    "Normal"
)
| where Severity != "Normal"
| order by TimeGenerated desc

Advanced Functions

String Functions:

// Advanced string manipulation
AzureActivity
| extend 
    // Extract patterns
    IpAddresses = extract_all(@"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", Message),
    
    // Parse JSON
    PropertiesJson = parse_json(Properties),
    
    // URL parsing
    ParsedUrl = parse_url(RequestUri),
    Domain = parse_url(RequestUri).Host,
    QueryParams = parse_url(RequestUri)["Query Parameters"],
    
    // Base64 encoding/decoding
    EncodedValue = base64_encode_tostring(Message),
    
    // Hash for anonymization
    HashedIp = hash_sha256(ClientIP)

Array and Bag Functions:

// Working with arrays and dynamic objects
AzureDiagnostics
| extend 
    // Array operations
    Tags = parse_json(tags_s),
    TagCount = array_length(Tags),
    FirstTag = Tags[0],
    
    // Bag operations  
    Properties = parse_json(properties_s),
    PropertyKeys = bag_keys(Properties),
    PropertyCount = array_length(bag_keys(Properties)),
    
    // Array manipulation
    SortedTags = array_sort_asc(Tags),
    UniqueTags = set_union(Tags, dynamic([]))

Mathematical Functions:

// Statistical analysis
Perf
| where CounterName == "% Processor Time"
| summarize 
    Avg = avg(CounterValue),
    StdDev = stdev(CounterValue),
    Variance = variance(CounterValue),
    Percentiles = percentiles(CounterValue, 50, 75, 90, 95, 99)
    by Computer
| extend 
    // Calculate z-score for anomaly detection
    ZScore = (Avg - Avg) / StdDev,
    
    // Round for display
    AvgRounded = round(Avg, 2),
    StdDevRounded = round(StdDev, 2)

Complex Real-World Query Patterns

Pattern 1: Request Success Rate with Latency Analysis

let timeRange = 1h;
let latencyThreshold = 1000; // ms
AppRequests
| where TimeGenerated > ago(timeRange)
| summarize 
    TotalRequests = count(),
    SuccessCount = countif(Success == true),
    FailureCount = countif(Success == false),
    AvgDuration = avg(DurationMs),
    P95Duration = percentile(DurationMs, 95),
    SlowRequests = countif(DurationMs > latencyThreshold)
    by OperationName, bin(TimeGenerated, 5m)
| extend 
    SuccessRate = round(100.0 * SuccessCount / TotalRequests, 2),
    SlowRequestPercent = round(100.0 * SlowRequests / TotalRequests, 2)
| where SuccessRate < 99.0 or SlowRequestPercent > 5.0
| order by TimeGenerated desc

Pattern 2: Resource Health Correlation Matrix

let timeWindow = 1h;
let highCpuThreshold = 80.0;
let highMemThreshold = 80.0;
// CPU data
let CpuData = Perf
| where TimeGenerated > ago(timeWindow)
| where CounterName == "% Processor Time"
| summarize AvgCpu = avg(CounterValue) by Computer;
// Memory data
let MemData = Perf
| where TimeGenerated > ago(timeWindow)
| where CounterName == "% Used Memory"
| summarize AvgMem = avg(CounterValue) by Computer;
// Heartbeat status
let HeartbeatData = Heartbeat
| where TimeGenerated > ago(5m)
| summarize LastSeen = max(TimeGenerated) by Computer
| extend MinutesSinceLastSeen = datetime_diff('minute', now(), LastSeen);
// Combine all metrics
CpuData
| join kind=leftouter (MemData) on Computer
| join kind=leftouter (HeartbeatData) on Computer
| extend 
    CpuStatus = case(AvgCpu > highCpuThreshold, "High", "Normal"),
    MemStatus = case(AvgMem > highMemThreshold, "High", "Normal"),
    HeartbeatStatus = case(
        isnull(MinutesSinceLastSeen), "No Data",
        MinutesSinceLastSeen > 3, "Stale",
        "Healthy"
    ),
    OverallHealth = case(
        isnull(MinutesSinceLastSeen) or MinutesSinceLastSeen > 3, "Unhealthy",
        AvgCpu > highCpuThreshold and AvgMem > highMemThreshold, "Critical",
        AvgCpu > highCpuThreshold or AvgMem > highMemThreshold, "Warning",
        "Healthy"
    )
| project Computer, OverallHealth, AvgCpu, AvgMem, MinutesSinceLastSeen, CpuStatus, MemStatus, HeartbeatStatus
| order by OverallHealth asc, AvgCpu desc

Pattern 3: Service Dependency Analysis

// Analyze service call patterns and failures
AppDependencies
| where TimeGenerated > ago(1h)
| summarize 
    CallCount = count(),
    FailureCount = countif(Success == false),
    AvgDuration = avg(DurationMs),
    P95Duration = percentile(DurationMs, 95)
    by Target, Type, bin(TimeGenerated, 5m)
| extend 
    FailureRate = round(100.0 * FailureCount / CallCount, 2),
    HealthStatus = case(
        FailureRate > 5.0, "Unhealthy",
        FailureRate > 1.0, "Degraded",
        "Healthy"
    )
| join kind=inner (
    AppExceptions
    | where TimeGenerated > ago(1h)
    | extend Target = tostring(parse_json(Properties).DependencyTarget)
    | summarize ExceptionCount = count() by Target, bin(TimeGenerated, 5m)
) on Target, TimeGenerated
| project TimeGenerated, Target, Type, CallCount, FailureRate, AvgDuration, P95Duration, ExceptionCount, HealthStatus
| order by FailureRate desc, TimeGenerated desc

Query Optimization Techniques

1. Use materialized views for repeated queries:

// For frequently accessed aggregations
.create materialized-view with (folder = "monitoring") 
HourlyCpuStats on table Perf
{
    Perf
    | where CounterName == "% Processor Time"
    | summarize 
        AvgCpu = avg(CounterValue),
        MaxCpu = max(CounterValue)
        by Computer, bin(TimeGenerated, 1h)
}

2. Partition by time first:

// Good - Time filter first
Perf
| where TimeGenerated > ago(1h)  // Partition pruning
| where Computer == "web-01"
| summarize avg(CounterValue)

// Less optimal
Perf
| where Computer == "web-01"
| where TimeGenerated > ago(1h)  // Scans more data first

3. Use summarize instead of distinct when possible:

// Faster
AzureActivity
| summarize by Computer

// Slower for large datasets
AzureActivity
| distinct Computer

Key Takeaways

join correlates data across tables - use with filters for performance
union combines similar tables - great for multi-source queries
mv-expand handles arrays and dynamic fields effectively
parse extracts structured data from strings
make-series creates time-series with automatic gap filling
let improves query readability and reusability
Advanced functions enable sophisticated data transformation
Always optimize by filtering early and reducing data processed

In Part 4, we'll focus specifically on querying Azure Log Analytics workspace, understanding different log schemas, and practical patterns for common Azure resources.

PreviousPart 2: KQL Syntax Fundamentals NextPart 4: Querying Azure Log Analytics Workspace

Last updated 14 hours ago

hashtagMoving Beyond the Basics

hashtagjoin - Combining Data from Multiple Tables

hashtagJoin Types I Use

hashtagReal-World Join Pattern: Correlating Application and Infrastructure

hashtagJoin Performance Tips from Experience

hashtagunion - Combining Similar Tables

hashtagBasic Union:

hashtagUnion with Wildcards:

hashtagCross-Workspace Queries:

hashtagReal Pattern: Unified Error Dashboard

hashtagmv-expand - Expanding Multi-Value Fields

hashtagExpanding Arrays:

hashtagReal Pattern: Analyzing Tags

hashtagExpanding Nested JSON:

hashtagparse - Extracting Structured Data

hashtagSimple Parse:

hashtagMultiple Parse Patterns:

hashtagReal Pattern: Parsing API Gateway Logs

hashtagmake-series - Time Series Analysis

hashtagBasic Time Series:

hashtagTime Series with Gap Filling:

hashtagReal Pattern: Anomaly Detection

hashtaglet - Creating Variables and Functions

hashtagVariable Definition:

hashtagTabular Variables:

hashtagFunctions with let:

hashtagReal Pattern: Reusable Time Windows

hashtagAdvanced Functions

hashtagString Functions:

hashtagArray and Bag Functions:

hashtagMathematical Functions:

hashtagComplex Real-World Query Patterns

hashtagPattern 1: Request Success Rate with Latency Analysis

hashtagPattern 2: Resource Health Correlation Matrix

hashtagPattern 3: Service Dependency Analysis

hashtagQuery Optimization Techniques

hashtag1. Use materialized views for repeated queries:

hashtag2. Partition by time first:

hashtag3. Use summarize instead of distinct when possible:

hashtagKey Takeaways