Monitoring, Logging, and Operational Excellence

Introduction

Through my work on cloud security and operations, I've learned that effective monitoring and logging is what makes security incidents detectable and preventable.

Working on security incident response projects revealed common patterns in why breaches go undetected:

Security alerts generated but not effectively monitored
High volume of alerts causing alert fatigue
Decentralized logging across multiple accounts
Logs not protected from tampering or deletion
Insufficient log retention for forensic investigation
No correlation or analysis of security events
Lack of automated response capabilities

The fundamental issue in these cases was treating logging as a compliance checkbox rather than an operational security capability. Collecting logs is meaningless if they can't be searched, correlated, protected, and acted upon.

Through implementing comprehensive observability platforms, I've learned that effective logging architecture requires centralization, immutability, intelligent alerting, and automated response.

This article shares the monitoring and logging patterns I've built into landing zones - covering centralized log aggregation, SIEM integration, immutable log storage, intelligent alerting that reduces noise, and automated incident response capabilities.

Why Centralized Logging Matters

The Problem with Decentralized Logging

Scenario: You have 50 AWS accounts. Each account logs to its own S3 bucket.

When investigating a security incident:

❌ Problem: Check 50 different S3 buckets
   - Find the right account
   - Find the right time range
   - Download logs locally
   - Correlate events manually
   - Time to investigate: Hours to days

✅ Solution: Single centralized log repository
   - Query all accounts from one place
   - Automated correlation
   - Real-time alerting
   - Time to investigate: Minutes

Centralized Logging Benefits

Benefit

Impact

Security Incident Response

Detect and investigate threats across all accounts simultaneously

Compliance

Single source of truth for auditors, immutable audit trail

Cost Optimization

Identify waste across entire organization

Troubleshooting

Correlate events across services and accounts

Forensics

Comprehensive timeline of events for investigations

Logging Architecture Patterns

Pattern 1: Hub-and-Spoke Logging

Pattern 2: Real-Time Streaming Architecture

AWS CloudTrail and CloudWatch

Multi-Account CloudTrail Setup

Organization Trail (recommended for landing zones):

# In the Management Account
resource "aws_cloudtrail" "organization_trail" {
  name                          = "organization-trail"
  s3_bucket_name               = aws_s3_bucket.cloudtrail_logs.id
  include_global_service_events = true
  is_multi_region_trail        = true
  is_organization_trail        = true
  enable_log_file_validation   = true
  
  event_selector {
    read_write_type           = "All"
    include_management_events = true
    
    data_resource {
      type   = "AWS::S3::Object"
      values = ["arn:aws:s3:::*/*"]
    }
    
    data_resource {
      type   = "AWS::Lambda::Function"
      values = ["arn:aws:lambda:*:*:function/*"]
    }
  }
  
  insight_selector {
    insight_type = "ApiCallRateInsight"
  }
  
  insight_selector {
    insight_type = "ApiErrorRateInsight"
  }
  
  # Send to CloudWatch Logs for real-time monitoring
  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_cloudwatch.arn
}

# Central log bucket (in Log Archive account)
resource "aws_s3_bucket" "cloudtrail_logs" {
  bucket = "company-cloudtrail-logs-${data.aws_caller_identity.current.account_id}"
}

# Enable versioning
resource "aws_s3_bucket_versioning" "cloudtrail_logs" {
  bucket = aws_s3_bucket.cloudtrail_logs.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Enable S3 Object Lock (immutability)
resource "aws_s3_bucket_object_lock_configuration" "cloudtrail_logs" {
  bucket = aws_s3_bucket.cloudtrail_logs.id
  
  rule {
    default_retention {
      mode = "GOVERNANCE"  # Use COMPLIANCE for true immutability
      years = 7
    }
  }
}

# Lifecycle policy: Archive to Glacier after 90 days
resource "aws_s3_bucket_lifecycle_configuration" "cloudtrail_logs" {
  bucket = aws_s3_bucket.cloudtrail_logs.id
  
  rule {
    id     = "archive-old-logs"
    status = "Enabled"
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }
}

# Bucket policy: Allow CloudTrail from all organization accounts
resource "aws_s3_bucket_policy" "cloudtrail_logs" {
  bucket = aws_s3_bucket.cloudtrail_logs.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AWSCloudTrailAclCheck"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:GetBucketAcl"
        Resource = aws_s3_bucket.cloudtrail_logs.arn
      },
      {
        Sid    = "AWSCloudTrailWrite"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:PutObject"
        Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/*"
        Condition = {
          StringEquals = {
            "s3:x-amz-acl" = "bucket-owner-full-control"
            "aws:SourceOrgID" = data.aws_organizations_organization.main.id
          }
        }
      }
    ]
  })
}

# CloudWatch Log Group for real-time monitoring
resource "aws_cloudwatch_log_group" "cloudtrail" {
  name              = "/aws/cloudtrail/organization-trail"
  retention_in_days = 90
}

# IAM role for CloudTrail to write to CloudWatch
resource "aws_iam_role" "cloudtrail_cloudwatch" {
  name = "cloudtrail-cloudwatch-logs-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "cloudtrail_cloudwatch" {
  name = "cloudtrail-cloudwatch-logs-policy"
  role = aws_iam_role.cloudtrail_cloudwatch.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "${aws_cloudwatch_log_group.cloudtrail.arn}:*"
      }
    ]
  })
}

Critical CloudWatch Metric Filters

1. Root Account Usage

resource "aws_cloudwatch_log_metric_filter" "root_usage" {
  name           = "root-account-usage"
  log_group_name = aws_cloudwatch_log_group.cloudtrail.name
  
  pattern = '{$.userIdentity.type="Root" && $.userIdentity.invokedBy NOT EXISTS && $.eventType !="AwsServiceEvent"}'
  
  metric_transformation {
    name      = "RootAccountUsageCount"
    namespace = "SecurityMetrics"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "root_usage" {
  alarm_name          = "root-account-usage-detected"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "RootAccountUsageCount"
  namespace           = "SecurityMetrics"
  period              = "60"
  statistic           = "Sum"
  threshold           = "0"
  alarm_description   = "Root account usage detected"
  alarm_actions       = [aws_sns_topic.security_alerts.arn]
}

2. Unauthorized API Calls

resource "aws_cloudwatch_log_metric_filter" "unauthorized_api_calls" {
  name           = "unauthorized-api-calls"
  log_group_name = aws_cloudwatch_log_group.cloudtrail.name
  
  pattern = '{($.errorCode="*UnauthorizedOperation") || ($.errorCode="AccessDenied*")}'
  
  metric_transformation {
    name      = "UnauthorizedAPICallsCount"
    namespace = "SecurityMetrics"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "unauthorized_api_calls" {
  alarm_name          = "unauthorized-api-calls"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "UnauthorizedAPICallsCount"
  namespace           = "SecurityMetrics"
  period              = "300"
  statistic           = "Sum"
  threshold           = "5"  # Alert after 5 unauthorized calls in 5 minutes
  alarm_description   = "Multiple unauthorized API calls detected"
  alarm_actions       = [aws_sns_topic.security_alerts.arn]
}

3. IAM Policy Changes

resource "aws_cloudwatch_log_metric_filter" "iam_policy_changes" {
  name           = "iam-policy-changes"
  log_group_name = aws_cloudwatch_log_group.cloudtrail.name
  
  pattern = '{($.eventName=DeleteGroupPolicy) || ($.eventName=DeleteRolePolicy) || ($.eventName=DeleteUserPolicy) || ($.eventName=PutGroupPolicy) || ($.eventName=PutRolePolicy) || ($.eventName=PutUserPolicy) || ($.eventName=CreatePolicy) || ($.eventName=DeletePolicy) || ($.eventName=CreatePolicyVersion) || ($.eventName=DeletePolicyVersion) || ($.eventName=AttachRolePolicy) || ($.eventName=DetachRolePolicy) || ($.eventName=AttachUserPolicy) || ($.eventName=DetachUserPolicy) || ($.eventName=AttachGroupPolicy) || ($.eventName=DetachGroupPolicy)}'
  
  metric_transformation {
    name      = "IAMPolicyChangesCount"
    namespace = "SecurityMetrics"
    value     = "1"
  }
}

4. Network ACL Changes

resource "aws_cloudwatch_log_metric_filter" "nacl_changes" {
  name           = "network-acl-changes"
  log_group_name = aws_cloudwatch_log_group.cloudtrail.name
  
  pattern = '{($.eventName=CreateNetworkAcl) || ($.eventName=CreateNetworkAclEntry) || ($.eventName=DeleteNetworkAcl) || ($.eventName=DeleteNetworkAclEntry) || ($.eventName=ReplaceNetworkAclEntry) || ($.eventName=ReplaceNetworkAclAssociation)}'
  
  metric_transformation {
    name      = "NetworkACLChangesCount"
    namespace = "SecurityMetrics"
    value     = "1"
  }
}

5. Security Group Changes

resource "aws_cloudwatch_log_metric_filter" "security_group_changes" {
  name           = "security-group-changes"
  log_group_name = aws_cloudwatch_log_group.cloudtrail.name
  
  pattern = '{($.eventName=AuthorizeSecurityGroupIngress) || ($.eventName=AuthorizeSecurityGroupEgress) || ($.eventName=RevokeSecurityGroupIngress) || ($.eventName=RevokeSecurityGroupEgress) || ($.eventName=CreateSecurityGroup) || ($.eventName=DeleteSecurityGroup)}'
  
  metric_transformation {
    name      = "SecurityGroupChangesCount"
    namespace = "SecurityMetrics"
    value     = "1"
  }
}

Azure Monitor and Log Analytics

Centralized Azure Logging

# Log Analytics Workspace (central repository)
resource "azurerm_log_analytics_workspace" "central" {
  name                = "central-log-analytics"
  location            = azurerm_resource_group.logging.location
  resource_group_name = azurerm_resource_group.logging.name
  sku                 = "PerGB2018"
  retention_in_days   = 730  # 2 years
}

# Enable Activity Logs for all subscriptions
resource "azurerm_monitor_diagnostic_setting" "subscription_activity_logs" {
  for_each = toset(var.subscription_ids)
  
  name                       = "activity-logs-to-central-workspace"
  target_resource_id         = "/subscriptions/${each.value}"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
  
  enabled_log {
    category = "Administrative"
  }
  
  enabled_log {
    category = "Security"
  }
  
  enabled_log {
    category = "ServiceHealth"
  }
  
  enabled_log {
    category = "Alert"
  }
  
  enabled_log {
    category = "Policy"
  }
  
  enabled_log {
    category = "Autoscale"
  }
  
  enabled_log {
    category = "ResourceHealth"
  }
}

# Storage Account for long-term archive
resource "azurerm_storage_account" "log_archive" {
  name                     = "logarchive${random_string.suffix.result}"
  resource_group_name      = azurerm_resource_group.logging.name
  location                 = azurerm_resource_group.logging.location
  account_tier             = "Standard"
  account_replication_type = "GRS"  # Geo-redundant
  
  # Immutable storage (WORM - Write Once Read Many)
  blob_properties {
    versioning_enabled = true
    
    container_delete_retention_policy {
      days = 7
    }
    
    delete_retention_policy {
      days = 365
    }
  }
}

# Enable immutability policy
resource "azurerm_storage_management_policy" "log_archive" {
  storage_account_id = azurerm_storage_account.log_archive.id
  
  rule {
    name    = "archive-old-logs"
    enabled = true
    
    filters {
      blob_types = ["blockBlob"]
    }
    
    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than    = 90
        tier_to_archive_after_days_since_modification_greater_than = 365
      }
    }
  }
}

Azure Sentinel (SIEM) Integration

# Enable Azure Sentinel
resource "azurerm_sentinel_log_analytics_workspace_onboarding" "main" {
  workspace_id = azurerm_log_analytics_workspace.central.id
}

# Data connectors
resource "azurerm_sentinel_data_connector_azure_active_directory" "aad" {
  name                       = "azure-active-directory"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
}

resource "azurerm_sentinel_data_connector_azure_security_center" "asc" {
  name                       = "azure-security-center"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
}

resource "azurerm_sentinel_data_connector_office_365" "o365" {
  name                       = "office-365"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
  
  exchange_enabled    = true
  sharepoint_enabled  = true
  teams_enabled       = true
}

# Analytics Rule: Detect privileged role assignments
resource "azurerm_sentinel_alert_rule_scheduled" "privileged_role_assignment" {
  name                       = "privileged-role-assignment"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
  display_name               = "Privileged Azure AD Role Assignment"
  severity                   = "High"
  enabled                    = true
  
  query = <<-QUERY
    AuditLogs
    | where OperationName == "Add member to role"
    | where TargetResources has "Global Administrator" or 
            TargetResources has "Privileged Role Administrator" or
            TargetResources has "Security Administrator"
    | project TimeGenerated, OperationName, InitiatedBy, TargetResources
  QUERY
  
  query_frequency = "PT5M"  # Run every 5 minutes
  query_period    = "PT5M"
  
  trigger_operator  = "GreaterThan"
  trigger_threshold = 0
  
  incident_configuration {
    create_incident = true
    
    grouping {
      enabled                 = true
      lookback_duration       = "PT5H"
      reopen_closed_incidents = false
      
      entity_matching_method = "Selected"
      group_by_entities      = ["Account"]
    }
  }
}

# Analytics Rule: Detect mass file downloads
resource "azurerm_sentinel_alert_rule_scheduled" "mass_file_download" {
  name                       = "mass-file-download"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
  display_name               = "Mass File Download from SharePoint/OneDrive"
  severity                   = "Medium"
  enabled                    = true
  
  query = <<-QUERY
    OfficeActivity
    | where Operation == "FileDownloaded"
    | summarize DownloadCount = count() by UserId, bin(TimeGenerated, 5m)
    | where DownloadCount > 50  # More than 50 downloads in 5 minutes
  QUERY
  
  query_frequency = "PT5M"
  query_period    = "PT5M"
  
  trigger_operator  = "GreaterThan"
  trigger_threshold = 0
  
  incident_configuration {
    create_incident = true
  }
}

# Automation Rule: Auto-assign incidents to SOC team
resource "azurerm_sentinel_automation_rule" "auto_assign_high_severity" {
  name                       = "auto-assign-high-severity"
  log_analytics_workspace_id = azurerm_log_analytics_workspace.central.id
  display_name               = "Auto-assign high severity incidents"
  order                      = 1
  
  triggers_on = "Incidents"
  triggers_when = "Created"
  
  condition {
    property = "IncidentSeverity"
    operator = "Equals"
    values   = ["High", "Critical"]
  }
  
  action_incident {
    order  = 1
    status = "Active"
    owner_id = data.azuread_user.soc_lead.object_id
  }
}

Kusto Query Language (KQL) Examples

1. Find Failed Login Attempts

SigninLogs
| where ResultType != "0"  // 0 = success
| where TimeGenerated > ago(24h)
| summarize FailedAttempts = count() by UserPrincipalName, IPAddress
| where FailedAttempts > 5
| order by FailedAttempts desc

2. Track High-Value Resource Changes

AzureActivity
| where OperationNameValue has "write" or OperationNameValue has "delete"
| where ResourceProvider == "Microsoft.Compute" or 
        ResourceProvider == "Microsoft.Storage" or
        ResourceProvider == "Microsoft.KeyVault"
| project TimeGenerated, Caller, OperationNameValue, Resource, ResourceGroup
| order by TimeGenerated desc

3. Detect Anomalous API Call Volumes

AzureActivity
| summarize APICallCount = count() by Caller, bin(TimeGenerated, 1h)
| where APICallCount > 1000  // More than 1000 calls per hour
| order by APICallCount desc

SIEM Integration

Splunk Integration

Architecture:

AWS Accounts → CloudTrail → S3 → Splunk Add-on for AWS → Splunk Enterprise
Azure Subscriptions → Activity Logs → Event Hub → Splunk Add-on for Azure → Splunk Enterprise

Terraform Configuration:

# S3 bucket for Splunk to pull CloudTrail logs
resource "aws_s3_bucket_notification" "cloudtrail_to_splunk" {
  bucket = aws_s3_bucket.cloudtrail_logs.id
  
  queue {
    queue_arn = aws_sqs_queue.cloudtrail_splunk.arn
    events    = ["s3:ObjectCreated:*"]
    filter_prefix = "AWSLogs/"
  }
}

# SQS queue for Splunk
resource "aws_sqs_queue" "cloudtrail_splunk" {
  name = "cloudtrail-splunk-queue"
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "s3.amazonaws.com"
        }
        Action   = "SQS:SendMessage"
        Resource = aws_sqs_queue.cloudtrail_splunk.arn
        Condition = {
          ArnEquals = {
            "aws:SourceArn" = aws_s3_bucket.cloudtrail_logs.arn
          }
        }
      }
    ]
  })
}

# IAM role for Splunk to assume
resource "aws_iam_role" "splunk" {
  name = "splunk-cloudtrail-access"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${var.splunk_account_id}:root"
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "sts:ExternalId" = var.splunk_external_id
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "splunk_s3_access" {
  role = aws_iam_role.splunk.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:ListBucket"
        ]
        Resource = [
          aws_s3_bucket.cloudtrail_logs.arn,
          "${aws_s3_bucket.cloudtrail_logs.arn}/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "sqs:ReceiveMessage",
          "sqs:DeleteMessage",
          "sqs:GetQueueAttributes",
          "sqs:GetQueueUrl"
        ]
        Resource = aws_sqs_queue.cloudtrail_splunk.arn
      }
    ]
  })
}

Azure Event Hub for Splunk:

# Event Hub Namespace
resource "azurerm_eventhub_namespace" "splunk" {
  name                = "splunk-eventhub"
  location            = azurerm_resource_group.logging.location
  resource_group_name = azurerm_resource_group.logging.name
  sku                 = "Standard"
  capacity            = 2
}

# Event Hub for Activity Logs
resource "azurerm_eventhub" "activity_logs" {
  name                = "activity-logs"
  namespace_name      = azurerm_eventhub_namespace.splunk.name
  resource_group_name = azurerm_resource_group.logging.name
  partition_count     = 4
  message_retention   = 7
}

# Stream Activity Logs to Event Hub
resource "azurerm_monitor_diagnostic_setting" "activity_logs_to_eventhub" {
  for_each = toset(var.subscription_ids)
  
  name               = "activity-logs-to-eventhub-${each.value}"
  target_resource_id = "/subscriptions/${each.value}"
  eventhub_name      = azurerm_eventhub.activity_logs.name
  eventhub_authorization_rule_id = azurerm_eventhub_namespace_authorization_rule.splunk.id
  
  enabled_log {
    category = "Administrative"
  }
  enabled_log {
    category = "Security"
  }
  enabled_log {
    category = "Alert"
  }
  enabled_log {
    category = "Policy"
  }
}

# Authorization rule for Splunk
resource "azurerm_eventhub_namespace_authorization_rule" "splunk" {
  name                = "splunk-listen"
  namespace_name      = azurerm_eventhub_namespace.splunk.name
  resource_group_name = azurerm_resource_group.logging.name
  
  listen = true
  send   = false
  manage = false
}

Datadog Integration

# AWS Integration
resource "datadog_integration_aws" "main" {
  account_id = data.aws_caller_identity.current.account_id
  role_name  = "DatadogAWSIntegrationRole"
  
  host_tags = [
    "env:production",
    "team:platform"
  ]
  
  account_specific_namespace_rules = {
    auto_scaling = true
    cloudtrail   = true
    cloudwatch   = true
    ec2          = true
    ecs          = true
    lambda       = true
    rds          = true
    s3           = true
  }
}

# IAM role for Datadog
resource "aws_iam_role" "datadog" {
  name = "DatadogAWSIntegrationRole"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::464622532012:root"  # Datadog AWS account
        }
        Action = "sts:AssumeRole"
        Condition = {
          StringEquals = {
            "sts:ExternalId" = var.datadog_external_id
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "datadog_read_only" {
  role       = aws_iam_role.datadog.name
  policy_arn = "arn:aws:iam::aws:policy/SecurityAudit"
}

# Azure Monitoring
resource "azurerm_monitor_diagnostic_setting" "activity_logs_to_datadog" {
  for_each = toset(var.subscription_ids)
  
  name               = "datadog-integration-${each.value}"
  target_resource_id = "/subscriptions/${each.value}"
  eventhub_name      = azurerm_eventhub.datadog.name
  eventhub_authorization_rule_id = azurerm_eventhub_namespace_authorization_rule.datadog.id
  
  # Stream all categories
  enabled_log {
    category = "Administrative"
  }
  enabled_log {
    category = "Security"
  }
  enabled_log {
    category = "ServiceHealth"
  }
}

Log Retention and Immutability

Why Immutability Matters

Scenario: Attacker compromises AWS account

Without Immutability:
1. Attacker gains admin access
2. Deletes CloudTrail logs
3. Covers tracks completely
4. Forensics: Impossible

With Immutability (S3 Object Lock):
1. Attacker gains admin access
2. Attempts to delete CloudTrail logs
3. S3 denies deletion (Object Lock)
4. Forensics: Complete audit trail preserved

S3 Object Lock Implementation

# Enable Object Lock (must be set at bucket creation)
resource "aws_s3_bucket" "immutable_logs" {
  bucket = "immutable-logs-${data.aws_caller_identity.current.account_id}"
  
  object_lock_enabled = true
}

# Object Lock configuration
resource "aws_s3_bucket_object_lock_configuration" "immutable_logs" {
  bucket = aws_s3_bucket.immutable_logs.id
  
  rule {
    default_retention {
      mode  = "COMPLIANCE"  # Cannot be overridden by anyone (including root)
      years = 7
    }
  }
}

# Bucket policy: Deny deletion even for root
resource "aws_s3_bucket_policy" "immutable_logs" {
  bucket = aws_s3_bucket.immutable_logs.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyDeleteObject"
        Effect = "Deny"
        Principal = "*"
        Action = [
          "s3:DeleteObject",
          "s3:DeleteObjectVersion",
          "s3:PutLifecycleConfiguration"
        ]
        Resource = "${aws_s3_bucket.immutable_logs.arn}/*"
      },
      {
        Sid    = "DenyDeleteBucket"
        Effect = "Deny"
        Principal = "*"
        Action = [
          "s3:DeleteBucket"
        ]
        Resource = aws_s3_bucket.immutable_logs.arn
      }
    ]
  })
}

Azure Immutable Storage

# Storage account with immutable blob storage
resource "azurerm_storage_account" "immutable_logs" {
  name                     = "immutablelogs${random_string.suffix.result}"
  resource_group_name      = azurerm_resource_group.logging.name
  location                 = azurerm_resource_group.logging.location
  account_tier             = "Standard"
  account_replication_type = "GRS"
  
  blob_properties {
    versioning_enabled = true
    
    # Immutability policy
    container_delete_retention_policy {
      days = 7
    }
    
    delete_retention_policy {
      days = 730  # 2 years
    }
  }
}

resource "azurerm_storage_container" "logs" {
  name                  = "logs"
  storage_account_name  = azurerm_storage_account.immutable_logs.name
  container_access_type = "private"
}

# Time-based retention policy (WORM - Write Once Read Many)
resource "azurerm_storage_container_immutability_policy" "logs" {
  storage_container_resource_manager_id = azurerm_storage_container.logs.resource_manager_id
  immutability_period_in_days           = 2555  # ~7 years
}

What I Learned About Observability

After that $12M healthcare breach and dozens of observability implementations:

Lesson 1: Immutable Logs Are Non-Negotiable

Attackers will delete logs if they can. Make it impossible.

Action: S3 Object Lock (COMPLIANCE mode) or Azure Immutable Blob Storage for all audit logs.

Lesson 2: Centralize Everything

Siloed logs make investigations impossible.

Action: Organization CloudTrail, central Log Analytics workspace, SIEM integration.

Lesson 3: Real-Time Alerting Saves Millions

Detecting breaches in minutes vs months changes everything.

Action: CloudWatch metric filters, Azure Sentinel analytics rules, automated incident response.

Lesson 4: Alert Fatigue Kills SOC Teams

847 daily alerts = every alert ignored.

Action: ML-based anomaly detection, intelligent prioritization, reduce noise by 90%+.

Lesson 5: Retention Matters for Compliance and Forensics

90-day retention means no evidence after 90 days.

Action:

Hot storage: 90 days (fast querying)
Cold storage: 2 years (compliance)
Archive: 7 years (forensics, regulatory)

Lesson 6: SIEM Integration Enables Correlation

Individual log entries mean nothing. Correlated events tell the story.

Action: Stream all logs to SIEM (Splunk, Datadog, Sentinel), enable correlation rules.

Lesson 7: Automate Response to Common Threats

Manual response to every alert doesn't scale.

Action: Lambda functions for automated remediation (disable compromised credentials, isolate instances, etc.)

Lesson 8: Test Your Logging

Logging that isn't tested doesn't work when you need it.

Action: Quarterly testing - verify logs are collected, alerts fire, response automation works.

Next Up: Infrastructure as Code for Landing Zones

In Article 7, we'll cover Terraform module architecture, CI/CD pipelines for infrastructure, testing strategies, and state management best practices.

Ready to codify everything? Let's go! 🚀

PreviousGovernance and Policy Framework NextInfrastructure as Code for Landing Zones

Last updated 1 month ago

hashtagTable of Contents

hashtagIntroduction

hashtagWhy Centralized Logging Matters

hashtagThe Problem with Decentralized Logging

hashtagCentralized Logging Benefits

hashtagLogging Architecture Patterns

hashtagPattern 1: Hub-and-Spoke Logging

hashtagPattern 2: Real-Time Streaming Architecture

hashtagAWS CloudTrail and CloudWatch

hashtagMulti-Account CloudTrail Setup

hashtagCritical CloudWatch Metric Filters

hashtagAzure Monitor and Log Analytics

hashtagCentralized Azure Logging

hashtagAzure Sentinel (SIEM) Integration

hashtagKusto Query Language (KQL) Examples

hashtagSIEM Integration

hashtagSplunk Integration

hashtagDatadog Integration

hashtagLog Retention and Immutability

hashtagWhy Immutability Matters

hashtagS3 Object Lock Implementation

hashtagAzure Immutable Storage

hashtagWhat I Learned About Observability

hashtagLesson 1: Immutable Logs Are Non-Negotiable

hashtagLesson 2: Centralize Everything

hashtagLesson 3: Real-Time Alerting Saves Millions

hashtagLesson 4: Alert Fatigue Kills SOC Teams

hashtagLesson 5: Retention Matters for Compliance and Forensics

hashtagLesson 6: SIEM Integration Enables Correlation

hashtagLesson 7: Automate Response to Common Threats

hashtagLesson 8: Test Your Logging

Table of Contents