Monitoring, Logging, and Operational Excellence

Table of Contents


Introduction

Through my work on cloud security and operations, I've learned that effective monitoring and logging is what makes security incidents detectable and preventable.

Working on security incident response projects revealed common patterns in why breaches go undetected:

  • Security alerts generated but not effectively monitored

  • High volume of alerts causing alert fatigue

  • Decentralized logging across multiple accounts

  • Logs not protected from tampering or deletion

  • Insufficient log retention for forensic investigation

  • No correlation or analysis of security events

  • Lack of automated response capabilities

The fundamental issue in these cases was treating logging as a compliance checkbox rather than an operational security capability. Collecting logs is meaningless if they can't be searched, correlated, protected, and acted upon.

Through implementing comprehensive observability platforms, I've learned that effective logging architecture requires centralization, immutability, intelligent alerting, and automated response.

This article shares the monitoring and logging patterns I've built into landing zones - covering centralized log aggregation, SIEM integration, immutable log storage, intelligent alerting that reduces noise, and automated incident response capabilities.


Why Centralized Logging Matters

The Problem with Decentralized Logging

Scenario: You have 50 AWS accounts. Each account logs to its own S3 bucket.

When investigating a security incident:

Centralized Logging Benefits

Benefit
Impact

Security Incident Response

Detect and investigate threats across all accounts simultaneously

Compliance

Single source of truth for auditors, immutable audit trail

Cost Optimization

Identify waste across entire organization

Troubleshooting

Correlate events across services and accounts

Forensics

Comprehensive timeline of events for investigations


Logging Architecture Patterns

Pattern 1: Hub-and-Spoke Logging

spinner

Pattern 2: Real-Time Streaming Architecture

spinner

AWS CloudTrail and CloudWatch

Multi-Account CloudTrail Setup

Organization Trail (recommended for landing zones):

Critical CloudWatch Metric Filters

1. Root Account Usage

2. Unauthorized API Calls

3. IAM Policy Changes

4. Network ACL Changes

5. Security Group Changes


Azure Monitor and Log Analytics

Centralized Azure Logging

Azure Sentinel (SIEM) Integration

Kusto Query Language (KQL) Examples

1. Find Failed Login Attempts

2. Track High-Value Resource Changes

3. Detect Anomalous API Call Volumes


SIEM Integration

Splunk Integration

Architecture:

Terraform Configuration:

Azure Event Hub for Splunk:

Datadog Integration


Log Retention and Immutability

Why Immutability Matters

Scenario: Attacker compromises AWS account

S3 Object Lock Implementation

Azure Immutable Storage


What I Learned About Observability

After that $12M healthcare breach and dozens of observability implementations:

Lesson 1: Immutable Logs Are Non-Negotiable

Attackers will delete logs if they can. Make it impossible.

Action: S3 Object Lock (COMPLIANCE mode) or Azure Immutable Blob Storage for all audit logs.

Lesson 2: Centralize Everything

Siloed logs make investigations impossible.

Action: Organization CloudTrail, central Log Analytics workspace, SIEM integration.

Lesson 3: Real-Time Alerting Saves Millions

Detecting breaches in minutes vs months changes everything.

Action: CloudWatch metric filters, Azure Sentinel analytics rules, automated incident response.

Lesson 4: Alert Fatigue Kills SOC Teams

847 daily alerts = every alert ignored.

Action: ML-based anomaly detection, intelligent prioritization, reduce noise by 90%+.

Lesson 5: Retention Matters for Compliance and Forensics

90-day retention means no evidence after 90 days.

Action:

  • Hot storage: 90 days (fast querying)

  • Cold storage: 2 years (compliance)

  • Archive: 7 years (forensics, regulatory)

Lesson 6: SIEM Integration Enables Correlation

Individual log entries mean nothing. Correlated events tell the story.

Action: Stream all logs to SIEM (Splunk, Datadog, Sentinel), enable correlation rules.

Lesson 7: Automate Response to Common Threats

Manual response to every alert doesn't scale.

Action: Lambda functions for automated remediation (disable compromised credentials, isolate instances, etc.)

Lesson 8: Test Your Logging

Logging that isn't tested doesn't work when you need it.

Action: Quarterly testing - verify logs are collected, alerts fire, response automation works.


Next Up: Infrastructure as Code for Landing Zones

In Article 7, we'll cover Terraform module architecture, CI/CD pipelines for infrastructure, testing strategies, and state management best practices.

Ready to codify everything? Let's go! 🚀

Last updated