Landing Zone Design Principles and Architecture Patterns

Introduction

Working with cloud infrastructure at scale has taught me that proper architecture and design principles are the difference between manageable and chaotic cloud environments.

I've worked on several projects where organizations accumulated AWS accounts over time without a clear organizational structure. Common challenges I encountered include:

Accounts from discontinued projects remaining active
Overlapping IP address ranges preventing network connectivity
Inconsistent naming conventions making resources hard to identify
Flat organizational structures with no logical grouping
Difficulty tracking costs and attributing spending
Compliance challenges from inconsistent configurations

These experiences highlighted critical gaps in cloud architecture that stem from not establishing clear design principles upfront. Through redesigning and implementing landing zone architectures across various organizations, I've identified the fundamental principles that create scalable, manageable cloud environments.

This article shares the design principles and architecture patterns I've learned through hands-on experience - what works at scale, what patterns enable growth, and what anti-patterns to avoid.

Core Design Principles

After building and rebuilding landing zones for years, I've distilled the essential principles that separate successful architectures from chaotic ones.

Principle 1: Security by Default

The Principle: Security controls should be automatically applied, not optionally configured.

What This Means:

Encryption at rest is required, not suggested
Network access is denied by default, allowed by exception
Logging is always on, can't be disabled
MFA is enforced, not recommended

Why It Matters:

At one company, we made security "configurable." Teams could choose whether to encrypt their databases, enable logging, or use MFA.

Result: 40% of accounts had unencrypted databases. 25% had logging disabled. Root accounts with no MFA.

The Fix: We changed security from opt-in to mandatory:

Service Control Policies (SCPs) prevent creating unencrypted resources
CloudTrail can't be disabled (SCP prevents it)
MFA required for all human access (enforced at identity provider)

Real Example - AWS SCP for Encryption:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnencryptedObjectUploads",
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": [
            "aws:kms",
            "AES256"
          ]
        }
      }
    },
    {
      "Sid": "DenyUnencryptedRDSCreation",
      "Effect": "Deny",
      "Action": [
        "rds:CreateDBInstance",
        "rds:CreateDBCluster"
      ],
      "Resource": "*",
      "Condition": {
        "Bool": {
          "rds:StorageEncrypted": "false"
        }
      }
    }
  ]
}

The Result: Security violations dropped from 40+ per week to 0-2 per week (and those 2 were quickly caught and fixed).

Principle 2: Separation of Concerns

The Principle: Different types of workloads and teams should be isolated from each other.

Separation Dimensions:

1. Environment Separation

Production ≠ Staging ≠ Development
No shared resources between environments
Different access controls
Different change management processes

2. Security Boundary Separation

Compliance-regulated data isolated from non-regulated
PCI workloads separate from non-PCI
HIPAA data isolated
Multi-tenant SaaS: Customer A isolated from Customer B

3. Team/Business Unit Separation

Marketing team ≠ Engineering team
Product A ≠ Product B
Each team can operate independently

4. Function Separation

Networking separate from compute
Logging separate from applications
Security services separate from workloads

Visual: Separation Architecture

Why It Matters:

At one company, they had a single "shared" account for dev, staging, and prod. A developer testing a database migration script accidentally ran it against production. $180,000 in lost orders before we caught it.

After separation: Impossible to accidentally affect prod from dev (different accounts, different credentials, different networks).

Principle 3: Least Privilege Access

The Principle: Users and services get minimum necessary permissions, nothing more.

Implementation:

Role-Based Access Control (RBAC):

Developer Role (in Dev accounts):
✅ Can: Create/modify resources
✅ Can: View logs and metrics
❌ Cannot: Modify networking
❌ Cannot: Access Prod accounts

Operations Role (in Prod accounts):
✅ Can: Deploy approved changes
✅ Can: View all resources
❌ Cannot: Delete resources without approval
❌ Cannot: Modify security settings

Security Role (in Security account):
✅ Can: View all accounts
✅ Can: Access all logs
✅ Can: Respond to incidents
❌ Cannot: Modify workload resources

Time-Bound Access:

Admin access granted for 4 hours, then expires
Emergency access auto-expires after 2 hours
All privileged access logged and reviewed

Real Example - AWS IAM Policy for Developer:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "s3:*",
        "rds:*",
        "lambda:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": ["us-east-1", "us-west-2"]
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": [
        "ec2:DeleteVpc",
        "ec2:DeleteSubnet",
        "ec2:DeleteInternetGateway",
        "iam:*",
        "organizations:*"
      ],
      "Resource": "*"
    }
  ]
}

Principle 4: Automation Over Manual Processes

The Principle: If humans do it more than twice, automate it.

What to Automate:

Account/Subscription Provisioning:

❌ Manual: Fill out form, wait 3 days for networking team, wait 2 days for security review
✅ Automated: Self-service portal, account created in 15 minutes with all controls

Security Baseline Configuration:

❌ Manual: Follow 50-page runbook, miss steps, inconsistent results
✅ Automated: Terraform applies identical baseline every time

Compliance Scanning:

❌ Manual: Weekly reviews, spreadsheets, missed violations
✅ Automated: Continuous scanning, automatic remediation, alerts on drift

Cost Reporting:

❌ Manual: Monthly manual cost allocation, Excel gymnastics
✅ Automated: Automated tagging, real-time chargeback dashboards

Why It Matters:

Manual processes don't scale:

10 accounts: Manual is fine (annoying but manageable)
50 accounts: Manual is painful (full-time job for one person)
100+ accounts: Manual is impossible (team of people can't keep up)

Automated processes scale linearly:

10 accounts: Same effort as 100 accounts
Platform team size doesn't grow with account count
Consistency guaranteed (computers don't forget steps)

Principle 5: Observable and Auditable

The Principle: Everything that happens should be logged, searchable, and traceable.

What to Log:

API Calls (CloudTrail, Activity Logs, Audit Logs):

Who did what, when, from where
Successful and failed attempts
Changes to infrastructure
Access to sensitive data

Network Traffic (VPC Flow Logs):

Source and destination IPs
Ports and protocols
Accepted and rejected connections
Traffic patterns for anomaly detection

Application Logs:

Application events
Errors and exceptions
Performance metrics
User actions

Security Events:

Login attempts (successful and failed)
Permission changes
Security group modifications
Encryption key usage

Implementation:

Why It Matters:

Story time: Security incident at 2am. Unauthorized EC2 instance running cryptocurrency mining.

Without centralized logging:

Took 6 hours to find which account
Another 4 hours to determine how it was created
Never definitively identified the attack vector
Cost: $14,000 in compute + 10 hours of engineer time

With centralized logging:

Query: "Show me all EC2 RunInstances calls in last 24 hours"
Found suspicious API call in 5 minutes
Traced to compromised access key
Revoked key, terminated instances
Total time: 20 minutes
Cost: $200 in compute

The logging made the difference between a $14K+ incident and a $200 nuisance.

Principle 6: Immutable Infrastructure

The Principle: Infrastructure is replaced, not modified.

What This Means:

❌ Mutable (Old Way):

SSH into server
Modify configuration files
Install packages
Server is now a "snowflake" (unique, undocumented)

✅ Immutable (New Way):

Define infrastructure in Terraform
Deploy new version
Destroy old version
Infrastructure is code (documented, version-controlled)

Benefits:

Reproducibility:

Can recreate entire environment from code
No drift (what's deployed matches what's in Git)
Disaster recovery is terraform apply

Auditability:

All changes in Git history
Code review for infrastructure changes
Rollback is git revert + terraform apply

Testing:

Test infrastructure changes in dev/staging first
Validate before applying to production
Catch errors before they impact users

Principle 7: Defense in Depth

The Principle: Multiple layers of security, no single point of failure.

Security Layers:

Layer 7: Application Security
├── Input validation
├── Authentication and authorization
├── Encryption of sensitive data
└── Secure coding practices

Layer 6: Data Security
├── Encryption at rest (always)
├── Encryption in transit (always)
├── Key rotation
└── Access controls

Layer 5: Network Security
├── Network segmentation
├── Firewall rules
├── Private subnets (no internet)
└── VPN/private connectivity

Layer 4: Platform Security
├── Patch management
├── Vulnerability scanning
├── Security baselines
└── Configuration compliance

Layer 3: Identity & Access
├── MFA required
├── Least privilege RBAC
├── Time-bound access
└── Access reviews

Layer 2: Monitoring & Detection
├── SIEM analysis
├── Threat detection (GuardDuty/Defender)
├── Anomaly detection
└── Security alerts

Layer 1: Governance & Compliance
├── Service Control Policies
├── Policy enforcement
├── Compliance scanning
└── Audit logging

Why It Matters:

If one layer fails, others provide protection:

Attacker bypasses firewall → Network segmentation limits access
Attacker compromises credentials → MFA prevents login
Attacker gains access → Monitoring detects anomaly
Attacker modifies resources → Audit logs provide evidence

Real Example:

Attack timeline with defense in depth:

10:00: Phishing email sent to developer
10:15: Developer clicks malicious link
10:20: Attacker harvests credentials
10:25: Attacker attempts login → MFA blocks (Layer 3)
10:30: Attacker attempts to bypass MFA → Anomaly detected (Layer 2)
10:35: Security team alerted → Credentials revoked (Layer 3)
10:40: Incident contained, zero damage

Without defense in depth: Attacker gains access at 10:25, unknown duration until discovery.

Principle 8: Cattle, Not Pets

The Principle: Infrastructure is disposable and replaceable, not precious and unique.

The Pet Model (Old Way):

Servers have names ("prod-db-01")
Manually configured and maintained
When broken, we nurse them back to health
Fear of deleting them (might break something)
Irreplaceable

The Cattle Model (New Way):

Resources are numbered ("web-server-0042")
Automatically provisioned from code
When broken, we destroy and replace
No fear of deletion (can recreate anytime)
Disposable

Implementation:

Why It Matters:

Disaster Recovery:

Pets: Must restore specific server (slow, risky)
Cattle: Launch new instances from code (fast, reliable)

Scaling:

Pets: Manually provision new servers (weeks)
Cattle: Auto-scaling handles it (minutes)

Updates:

Pets: SSH and update each server (error-prone)
Cattle: Deploy new version, drain old instances

Account/Subscription Organization Patterns

The foundation of landing zone design is how you organize accounts or subscriptions. Get this wrong, and everything built on top is compromised.

Pattern 1: Environment-Based Organization

Structure:

Organization Root
├── Production OU/MG
│   ├── Prod Account 1
│   ├── Prod Account 2
│   └── Prod Account N
├── Staging OU/MG
│   ├── Stage Account 1
│   ├── Stage Account 2
│   └── Stage Account N
└── Development OU/MG
    ├── Dev Account 1
    ├── Dev Account 2
    └── Dev Account N

When to Use:

Small to medium organizations (<100 accounts)
Single product or closely related products
Simple organizational structure

Advantages:

✅ Clear environment separation
✅ Easy to understand
✅ Simple policy application (all prod accounts get prod policies)

Disadvantages:

❌ Doesn't scale well with multiple business units
❌ Mixed teams in same OU (marketing and engineering both in prod)
❌ Hard to allocate costs to different business units

Real Example:

SaaS startup with single product:

Root
├── Production (3 accounts)
│   ├── web-app-prod
│   ├── api-prod
│   └── data-pipeline-prod
├── Staging (3 accounts)
│   ├── web-app-stage
│   ├── api-stage
│   └── data-pipeline-stage
└── Development (10 accounts)
    ├── dev-alice
    ├── dev-bob
    └── ... (one per developer)

Pattern 2: Business Unit Organization

Structure:

Organization Root
├── Business Unit A OU/MG
│   ├── BU-A Prod
│   ├── BU-A Staging
│   └── BU-A Dev
├── Business Unit B OU/MG
│   ├── BU-B Prod
│   ├── BU-B Staging
│   └── BU-B Dev
└── Business Unit C OU/MG
    ├── BU-C Prod
    ├── BU-C Staging
    └── BU-C Dev

When to Use:

Multiple business units with independent P&Ls
Distributed teams (different geographies, different management)
Need for cost allocation by business unit

Advantages:

✅ Clear cost ownership
✅ Independent operation (BU-A can't affect BU-B)
✅ Different policies per business unit (if needed)

Disadvantages:

❌ Potential duplication (each BU builds own shared services)
❌ Less standardization across org
❌ More complex to manage

Real Example:

E-commerce company with multiple brands:

Root
├── Retail Brand OU (20 accounts)
│   ├── Prod: Web, API, Database
│   ├── Staging: Web, API, Database
│   └── Dev: Multiple sandbox accounts
├── Marketplace Brand OU (15 accounts)
│   ├── Prod: Marketplace, Payments, Sellers
│   ├── Staging: Full stack
│   └── Dev: Sandbox accounts
└── Logistics Division OU (12 accounts)
    ├── Prod: Warehouse, Shipping, Tracking
    ├── Staging: Full stack
    └── Dev: Sandbox accounts

Pattern 3: Hybrid (Environment × Business Unit)

Structure:

Organization Root
├── Production OU
│   ├── BU-A Prod Accounts
│   ├── BU-B Prod Accounts
│   └── BU-C Prod Accounts
├── Staging OU
│   ├── BU-A Staging Accounts
│   ├── BU-B Staging Accounts
│   └── BU-C Staging Accounts
└── Development OU
    ├── BU-A Dev Accounts
    ├── BU-B Dev Accounts
    └── BU-C Dev Accounts

When to Use:

Medium to large organizations (100-500 accounts)
Need both environment policies AND business unit separation
Centralized platform team with distributed application teams

Advantages:

✅ Environment-level policies (all prod accounts locked down)
✅ Business unit cost allocation (tagging)
✅ Combines benefits of both patterns

Disadvantages:

❌ More complex hierarchy
❌ Two dimensions to manage (environment AND business unit)

Pattern 4: Workload-Based Organization

Structure:

Organization Root
├── Infrastructure Workloads OU
│   ├── Network Account
│   ├── Security Account
│   ├── Logging Account
│   └── Shared Services Account
├── Application Workloads OU
│   ├── Web Tier Accounts
│   ├── API Tier Accounts
│   └── Data Tier Accounts
└── Analytics Workloads OU
    ├── Data Lake Account
    ├── Data Warehouse Account
    └── ML Platform Account

When to Use:

Specialized workload types with different requirements
Different compliance needs (PCI vs HIPAA vs general)
Different architectural patterns per workload type

Advantages:

✅ Workload-specific policies (data lake accounts get different policies than web accounts)
✅ Specialized configuration per workload type
✅ Clear functional boundaries

Disadvantages:

❌ Can be confusing for teams ("where does this workload go?")
❌ Doesn't inherently separate environments

Recommended Pattern for Most Organizations

After trying all these patterns, here's what I recommend:

For startups and small companies (<50 accounts):

Root
├── Platform (shared services)
├── Production
├── Staging
└── Development

For mid-size companies (50-200 accounts):

Root
├── Core Infrastructure
│   ├── Network
│   ├── Security
│   ├── Logging
│   └── Shared Services
├── Production
│   ├── Product A Prod
│   └── Product B Prod
├── Staging
│   ├── Product A Stage
│   └── Product B Stage
└── Development
    └── Dev Sandboxes

For large enterprises (200+ accounts):

Root
├── Core Infrastructure
├── Business Unit A
│   ├── Production
│   ├── Staging
│   └── Development
├── Business Unit B
│   ├── Production
│   ├── Staging
│   └── Development
├── Compliance-Isolated
│   ├── PCI Workloads
│   └── HIPAA Workloads
└── Sandbox/Innovation

My Hard-Learned Lesson

At one company, we started with environment-based (simple). As we grew to 8 business units, we tried to retrofit business unit organization.

The migration was a nightmare:

6 months to reorganize 200 accounts
Broke network connectivity during migration
Confused teams ("wait, which OU is my account in now?")
Cost: 4 engineers × 6 months = $360,000 in labor

The lesson: Design for your 3-year state, not your current state.

If you anticipate multiple business units, design for it from day one. Reorganizing later is 10x harder.

Management Group and OU Hierarchies

The management hierarchy defines how policies and governance flow down through your organization. This is where you enforce standards that can't be violated.

AWS Organizations - Organizational Units (OUs)

Structure:

Root
├── Security OU
│   ├── SCPs: Deny all destructive actions
│   └── Log Archive Account (immutable)
├── Infrastructure OU
│   ├── SCPs: Allow infrastructure provisioning
│   ├── Network Hub Account
│   └── Shared Services Account
└── Workloads OU
    ├── Production OU
    │   ├── SCPs: Require encryption, MFA, logging
    │   └── Application Accounts
    └── Non-Production OU
        ├── SCPs: Cost controls, region restrictions
        └── Dev/Test Accounts

Service Control Policies (SCPs) - The Guardrails:

SCPs are the ultimate authority. Even root users can't bypass SCPs.

Example - Prevent Leaving Organization:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "organizations:LeaveOrganization"
      ],
      "Resource": "*"
    }
  ]
}

Example - Require Encryption:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireS3Encryption",
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": ["aws:kms", "AES256"]
        }
      }
    },
    {
      "Sid": "RequireRDSEncryption",
      "Effect": "Deny",
      "Action": [
        "rds:CreateDBInstance",
        "rds:CreateDBCluster"
      ],
      "Resource": "*",
      "Condition": {
        "Bool": {
          "rds:StorageEncrypted": "false"
        }
      }
    },
    {
      "Sid": "RequireEBSEncryption",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:volume/*",
      "Condition": {
        "Bool": {
          "ec2:Encrypted": "false"
        }
      }
    }
  ]
}

Example - Restrict Regions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAllOutsideRequestedRegions",
      "Effect": "Deny",
      "NotAction": [
        "iam:*",
        "organizations:*",
        "route53:*",
        "budgets:*",
        "ce:*",
        "s3:GetBucketLocation",
        "s3:ListAllMyBuckets"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "us-east-1",
            "us-west-2",
            "eu-west-1"
          ]
        }
      }
    }
  ]
}

Real Story - SCPs Saved Us:

At one company, a contractor accidentally ran:

aws ec2 create-volume --size 1000 --region ap-south-1

Why? They were used to working in ap-south-1 at their previous client.

Without region SCP: 1TB volume created in unsupported region, $120/month until discovered 6 months later = $720 wasted.

With region SCP: Request immediately denied. "Access Denied: Region ap-south-1 is not approved." $0 wasted.

SCPs are guardrails that prevent costly mistakes.

Azure Management Groups

Structure:

Tenant Root Group
├── Platform MG
│   ├── Identity (Azure AD Connect, etc.)
│   ├── Management (Monitor, Automation)
│   └── Connectivity (Hub VNets, ExpressRoute)
├── Landing Zones MG
│   ├── Corp MG (Internal apps)
│   │   ├── Production Subscriptions
│   │   └── Non-Production Subscriptions
│   └── Online MG (External-facing apps)
│       ├── Production Subscriptions
│       └── Non-Production Subscriptions
├── Sandbox MG (Isolated experimentation)
└── Decommissioned MG (Sunset workloads)

Azure Policy at Management Group Level:

Example - Require Encryption:

{
  "properties": {
    "displayName": "Require storage account encryption",
    "policyType": "Custom",
    "mode": "All",
    "parameters": {},
    "policyRule": {
      "if": {
        "allOf": [
          {
            "field": "type",
            "equals": "Microsoft.Storage/storageAccounts"
          },
          {
            "field": "Microsoft.Storage/storageAccounts/encryption.services.blob.enabled",
            "notEquals": "true"
          }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

Example - Allowed Regions:

{
  "properties": {
    "displayName": "Allowed locations",
    "policyType": "BuiltIn",
    "mode": "Indexed",
    "parameters": {
      "allowedLocations": {
        "type": "Array",
        "metadata": {
          "displayName": "Allowed locations",
          "strongType": "location"
        }
      }
    },
    "policyRule": {
      "if": {
        "not": {
          "field": "location",
          "in": "[parameters('allowedLocations')]"
        }
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

Terraform for Azure Policy Assignment:

resource "azurerm_management_group_policy_assignment" "require_encryption" {
  name                 = "require-encryption"
  management_group_id  = azurerm_management_group.landing_zones.id
  policy_definition_id = azurerm_policy_definition.require_encryption.id
  
  parameters = jsonencode({
    effect = {
      value = "Deny"
    }
  })
}

resource "azurerm_management_group_policy_assignment" "allowed_regions" {
  name                 = "allowed-regions"
  management_group_id  = azurerm_management_group.root.id
  policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/e56962a6-4747-49cd-b67b-bf8b01975c4c"
  
  parameters = jsonencode({
    allowedLocations = {
      value = ["eastus", "westus", "westeurope"]
    }
  })
}

GCP Folder Hierarchies

Structure:

Organization (example.com)
├── Platform Folder
│   ├── Networking Project
│   ├── Security Project
│   └── Logging Project
├── Production Folder
│   ├── App-A Prod Project
│   └── App-B Prod Project
├── Non-Production Folder
│   ├── Staging Projects
│   └── Development Projects
└── Sandbox Folder
    └── Experiment Projects

Organization Policies:

Example - Restrict External IPs:

resource "google_organization_policy" "restrict_external_ips" {
  org_id     = var.organization_id
  constraint = "compute.vmExternalIpAccess"
  
  list_policy {
    deny {
      all = true
    }
  }
}

Example - Allowed Regions:

resource "google_organization_policy" "allowed_regions" {
  org_id     = var.organization_id
  constraint = "gcp.resourceLocations"
  
  list_policy {
    allowed_values = [
      "in:us-locations",
      "in:eu-locations"
    ]
  }
}

Hierarchy Design Best Practices

Principle 1: Policy Inheritance

Policies flow downward and accumulate:

Root OU (Deny: Delete CloudTrail)
└── Production OU (Require: MFA for console access)
    └── PCI Workloads OU (Require: Encryption)
        └── Account XYZ
            ↳ Inherits ALL three policies

Principle 2: Start Restrictive, Selectively Allow

❌ Wrong: Permissive root, restrict leaf OUs
  Problem: Easy to bypass by moving accounts up the tree

✅ Right: Restrictive root, selectively allow in leaf OUs
  Benefit: Default is secure, exceptions are explicit

Principle 3: Separate Platform from Workloads

Platform OU (different policies)
├── Network: Allowed to modify VPCs, Transit Gateways
├── Security: Allowed to access all accounts
└── Logging: Immutable storage policies

Workload OU (different policies)
├── Cost controls
├── Environment-specific restrictions
└── Standard security baseline

My Painful Lesson:

At one company, we put ALL accounts under the same OU with the same policies.

Problem: Network team needed to create VPCs, but SCP denied VPC creation (security team added this).

Workaround: Security team exempted specific IAM roles from SCP.

Result: Complex SCP with 40+ exemptions. Impossible to understand. Security gaps from over-complicated logic.

The Fix: Separate OUs for platform vs workloads:

Platform OU: Different policies (allows infrastructure provisioning)
Workload OU: Standard restrictive policies

Lesson: Design OUs around different policy needs, not arbitrary organizational structure.

Hub-and-Spoke Network Topology

Network architecture is the foundation of your landing zone. Get it wrong, and you'll spend years untangling it.

The Problem: Mesh Networking Doesn't Scale

Mesh (Point-to-Point) Peering:

Number of peering connections:

4 accounts = 6 peering connections
10 accounts = 45 peering connections
50 accounts = 1,225 peering connections (!!)
100 accounts = 4,950 peering connections (!!!)

Formula: n * (n-1) / 2

Problems:

❌ Management nightmare
❌ IP address conflicts
❌ No centralized security inspection
❌ Difficult to implement shared services
❌ Scales as O(n²)

The Solution: Hub-and-Spoke

Hub-and-Spoke Topology:

Number of connections:

4 accounts = 4 connections to hub
10 accounts = 10 connections
50 accounts = 50 connections
100 accounts = 100 connections

Scales as O(n) instead of O(n²)!

AWS Implementation - Transit Gateway

Architecture:

# Central Network Account - Transit Gateway
resource "aws_ec2_transit_gateway" "main" {
  description                     = "Main hub for all account connectivity"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "main-transit-gateway"
  }
}

# Production Route Table (isolated)
resource "aws_ec2_transit_gateway_route_table" "production" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "production-routes"
  }
}

# Non-Production Route Table (shared)
resource "aws_ec2_transit_gateway_route_table" "non_production" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "non-production-routes"
  }
}

# Shared Services Route Table
resource "aws_ec2_transit_gateway_route_table" "shared_services" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "shared-services-routes"
  }
}

# Example: Attach Production VPC
resource "aws_ec2_transit_gateway_vpc_attachment" "prod_app_vpc" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.prod_app.id
  subnet_ids         = aws_subnet.prod_private[*].id
  
  transit_gateway_default_route_table_association = false
  transit_gateway_default_route_table_propagation = false
  
  tags = {
    Name = "prod-app-attachment"
  }
}

# Associate with Production Route Table
resource "aws_ec2_transit_gateway_route_table_association" "prod_app" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.prod_app_vpc.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
}

# Propagate routes to Shared Services
resource "aws_ec2_transit_gateway_route_table_propagation" "prod_to_shared" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.prod_app_vpc.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.shared_services.id
}

Routing Rules:

Production Route Table:
├── Allows: Communication to Shared Services
├── Allows: Communication to On-Premises
├── Denies: Communication to Non-Production
└── Denies: Communication to other Production workloads (unless explicitly allowed)

Non-Production Route Table:
├── Allows: Communication to Shared Services
├── Allows: Communication between Dev/Staging
└── Denies: Communication to Production

Shared Services Route Table:
├── Allows: Communication from Production
├── Allows: Communication from Non-Production
└── Provides: DNS, Active Directory, Monitoring

Azure Implementation - Hub VNet with Peering

Architecture:

# Hub VNet (Network Account)
resource "azurerm_virtual_network" "hub" {
  name                = "hub-vnet"
  location            = azurerm_resource_group.network.location
  resource_group_name = azurerm_resource_group.network.name
  address_space       = ["10.0.0.0/16"]
}

# Hub Subnets
resource "azurerm_subnet" "firewall" {
  name                 = "AzureFirewallSubnet"
  resource_group_name  = azurerm_resource_group.network.name
  virtual_network_name = azurerm_virtual_network.hub.name
  address_prefixes     = ["10.0.1.0/24"]
}

resource "azurerm_subnet" "gateway" {
  name                 = "GatewaySubnet"
  resource_group_name  = azurerm_resource_group.network.name
  virtual_network_name = azurerm_virtual_network.hub.name
  address_prefixes     = ["10.0.2.0/24"]
}

# Spoke VNet (Production Subscription)
resource "azurerm_virtual_network" "prod_spoke" {
  name                = "prod-spoke-vnet"
  location            = azurerm_resource_group.prod.location
  resource_group_name = azurerm_resource_group.prod.name
  address_space       = ["10.1.0.0/16"]
}

# Peering: Spoke to Hub
resource "azurerm_virtual_network_peering" "spoke_to_hub" {
  name                      = "prod-to-hub"
  resource_group_name       = azurerm_resource_group.prod.name
  virtual_network_name      = azurerm_virtual_network.prod_spoke.name
  remote_virtual_network_id = azurerm_virtual_network.hub.id
  
  allow_virtual_network_access = true
  allow_forwarded_traffic      = true  # Allow traffic via hub firewall
  use_remote_gateways          = true  # Use hub's VPN gateway
}

# Peering: Hub to Spoke
resource "azurerm_virtual_network_peering" "hub_to_spoke" {
  name                      = "hub-to-prod"
  resource_group_name       = azurerm_resource_group.network.name
  virtual_network_name      = azurerm_virtual_network.hub.name
  remote_virtual_network_id = azurerm_virtual_network.prod_spoke.id
  
  allow_virtual_network_access = true
  allow_forwarded_traffic      = true
  allow_gateway_transit        = true  # Share hub's VPN gateway
}

# Azure Firewall in Hub
resource "azurerm_firewall" "hub" {
  name                = "hub-firewall"
  location            = azurerm_resource_group.network.location
  resource_group_name = azurerm_resource_group.network.name
  sku_name            = "AZFW_VNet"
  sku_tier            = "Standard"
  
  ip_configuration {
    name                 = "configuration"
    subnet_id            = azurerm_subnet.firewall.id
    public_ip_address_id = azurerm_public_ip.firewall.id
  }
}

# Firewall Rule: Allow Prod to Shared Services
resource "azurerm_firewall_network_rule_collection" "prod_to_shared" {
  name                = "prod-to-shared-services"
  azure_firewall_name = azurerm_firewall.hub.name
  resource_group_name = azurerm_resource_group.network.name
  priority            = 100
  action              = "Allow"
  
  rule {
    name = "allow-dns"
    source_addresses = ["10.1.0.0/16"]  # Prod spoke
    destination_addresses = ["10.2.0.0/16"]  # Shared services spoke
    destination_ports = ["53"]
    protocols = ["UDP"]
  }
  
  rule {
    name = "allow-ad"
    source_addresses = ["10.1.0.0/16"]
    destination_addresses = ["10.2.0.0/16"]
    destination_ports = ["389", "636", "88", "445"]
    protocols = ["TCP"]
  }
}

Traffic Flow with Inspection

Spoke-to-Spoke Traffic (with Firewall Inspection):

User-Defined Routes (UDRs) Force Traffic Through Firewall:

# Route Table for Production Spoke
resource "azurerm_route_table" "prod_spoke" {
  name                = "prod-spoke-routes"
  location            = azurerm_resource_group.prod.location
  resource_group_name = azurerm_resource_group.prod.name
}

# Route: Traffic to Shared Services via Firewall
resource "azurerm_route" "prod_to_shared_via_fw" {
  name                   = "to-shared-services"
  resource_group_name    = azurerm_resource_group.prod.name
  route_table_name       = azurerm_route_table.prod_spoke.name
  address_prefix         = "10.2.0.0/16"  # Shared services CIDR
  next_hop_type          = "VirtualAppliance"
  next_hop_in_ip_address = azurerm_firewall.hub.ip_configuration[0].private_ip_address
}

# Associate Route Table with Subnet
resource "azurerm_subnet_route_table_association" "prod" {
  subnet_id      = azurerm_subnet.prod_app.id
  route_table_id = azurerm_route_table.prod_spoke.id
}

IP Address Planning

Critical for Success:

Network Segment

CIDR Block

Usable IPs

Purpose

Hub VNet

10.0.0.0/16

65,536

Central connectivity

Production Spoke 1

10.1.0.0/16

65,536

App A production

Production Spoke 2

10.2.0.0/16

65,536

App B production

Staging Spoke

10.10.0.0/16

65,536

Staging environments

Development Spoke

10.20.0.0/16

65,536

Development sandboxes

Shared Services

10.100.0.0/16

65,536

DNS, AD, Monitoring

On-Premises

192.168.0.0/16

65,536

Corporate network

Rules:

✅ No overlaps: Each spoke has unique CIDR
✅ Room to grow: Leave gaps for future spokes
✅ Consistent sizing: Similar-sized blocks for easier management
✅ Document everything: IP address registry is essential

My Painful Story:

At one company, they didn't plan IP addresses. Each team picked their own CIDRs.

Result:

40% of VPCs used 10.0.0.0/24
Couldn't peer them (overlapping IPs)
Had to renumber 15 VPCs (months of work)
Broke applications during migration
Cost: 3 engineers × 4 months = $180,000

The lesson: Spend 1 day planning IP addresses to avoid months of rework.

Resource Organization Strategies

Beyond accounts/subscriptions, how do you organize resources within each account?

Tagging Strategy

Tags are metadata attached to resources. They enable:

Cost allocation
Resource discovery
Automation
Compliance tracking

Required Tags (Mandatory on All Resources):

Tag Key

Values

Purpose

Example

Environment

prod, staging, dev

Environment classification

Environment=prod

CostCenter

Business unit code

Cost allocation

CostCenter=engineering

Owner

Email or team name

Accountability

[email protected]

Application

Application name

Workload identification

Application=payment-api

DataClassification

public, internal, confidential, restricted

Security and compliance

DataClassification=confidential

Compliance

pci, hipaa, sox, none

Regulatory requirements

Compliance=pci

Optional Tags (Recommended):

Tag Key

Purpose

Example

Project

Project tracking

Project=mobile-app-redesign

Terraform

IaC management

Terraform=true

BackupPolicy

Backup requirements

BackupPolicy=daily-7day-retention

ManagedBy

Automation tool

ManagedBy=terraform

CreatedDate

Resource creation tracking

CreatedDate=2024-01-15

ExpirationDate

Cleanup automation

ExpirationDate=2024-06-30

Terraform - Enforce Tagging:

# AWS - Require tags via SCP
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "ec2:CreateVolume",
        "ec2:CreateSnapshot",
        "rds:CreateDBInstance",
        "s3:CreateBucket"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotLike": {
          "aws:RequestTag/Environment": ["prod", "staging", "dev"],
          "aws:RequestTag/CostCenter": "*",
          "aws:RequestTag/Owner": "*"
        }
      }
    }
  ]
}

Azure - Require tags via Policy:

resource "azurerm_policy_definition" "require_tags" {
  name         = "require-mandatory-tags"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Require mandatory tags on resources"
  
  policy_rule = jsonencode({
    if = {
      anyOf = [
        {
          field  = "tags['Environment']"
          exists = "false"
        },
        {
          field  = "tags['CostCenter']"
          exists = "false"
        },
        {
          field  = "tags['Owner']"
          exists = "false"
        }
      ]
    }
    then = {
      effect = "deny"
    }
  })
}

Automated Tagging in Terraform:

# Define standard tags in variables
locals {
  common_tags = {
    Environment        = var.environment
    CostCenter         = "engineering"
    Owner              = "[email protected]"
    Terraform          = "true"
    ManagedBy          = "terraform"
    CreatedDate        = timestamp()
    DataClassification = "internal"
  }
}

# Apply to all resources
resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = "t3.medium"
  
  tags = merge(
    local.common_tags,
    {
      Name        = "web-server-${var.environment}"
      Application = "web-frontend"
    }
  )
}

resource "aws_s3_bucket" "app_data" {
  bucket = "app-data-${var.environment}"
  
  tags = merge(
    local.common_tags,
    {
      Application = "data-storage"
      BackupPolicy = "daily-30day-retention"
    }
  )
}

Naming Conventions

Consistent naming makes resources discoverable and reduces confusion.

Pattern:

{environment}-{application}-{resource-type}-{region}-{instance}

Examples:
- prod-api-server-us-east-1-01
- staging-web-rds-eu-west-1-primary
- dev-cache-elasticache-us-west-2-01

Account/Subscription Names:

{environment}-{business-unit}-{purpose}

Examples:
- prod-retail-web
- staging-retail-api
- dev-retail-sandbox-01
- prod-marketplace-payments
- platform-network-hub
- platform-security-logging

VPC/VNet Names:

{environment}-{purpose}-vpc

Examples:
- prod-app-vpc
- staging-app-vpc
- prod-shared-services-vpc
- platform-network-hub-vpc

Terraform - Enforce Naming via Validation:

variable "environment" {
  type        = string
  description = "Environment name"
  
  validation {
    condition     = can(regex("^(prod|staging|dev)$", var.environment))
    error_message = "Environment must be prod, staging, or dev."
  }
}

variable "resource_name" {
  type        = string
  description = "Resource name"
  
  validation {
    condition     = can(regex("^[a-z0-9-]+$", var.resource_name))
    error_message = "Resource name must contain only lowercase letters, numbers, and hyphens."
  }
}

locals {
  name_prefix = "${var.environment}-${var.application}"
}

resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type
  
  tags = {
    Name = "${local.name_prefix}-server-${count.index + 1}"
  }
  
  count = var.instance_count
}

Resource Group Strategy (Azure)

In Azure, Resource Groups are containers for related resources.

Pattern 1: One Resource Group per Environment per Application

rg-prod-payment-api-eastus
├── App Service
├── SQL Database
├── Key Vault
└── Application Insights

rg-prod-user-service-eastus
├── App Service
├── CosmosDB
└── Service Bus

Pattern 2: Separate by Resource Lifecycle

rg-prod-payment-api-app-eastus (Application resources)
├── App Service
└── Application Insights

rg-prod-payment-api-data-eastus (Data resources - different backup/retention)
├── SQL Database
└── Storage Account

rg-prod-payment-api-network-eastus (Networking - rarely changes)
├── Virtual Network
└── Network Security Groups

Terraform - Azure Resource Groups:

resource "azurerm_resource_group" "app" {
  name     = "rg-${var.environment}-${var.application}-app-${var.location}"
  location = var.location
  
  tags = merge(
    local.common_tags,
    {
      Purpose = "Application runtime resources"
    }
  )
}

resource "azurerm_resource_group" "data" {
  name     = "rg-${var.environment}-${var.application}-data-${var.location}"
  location = var.location
  
  tags = merge(
    local.common_tags,
    {
      Purpose          = "Data and state storage"
      BackupPolicy     = "daily-30day-retention"
      DataClassification = "confidential"
    }
  )
}

Naming Conventions and Standards

Naming conventions are essential for resource discovery, automation, and reducing operational errors.

Core Principles

1. Predictable: Anyone should be able to guess the name format 2. Descriptive: Name should convey purpose 3. Unique: No name collisions 4. Sortable: Logical alphabetical ordering 5. Automatable: Easy to generate programmatically

Standard Naming Pattern

General Format:

{organization}-{environment}-{location}-{application}-{resource-type}-{instance}

Examples:

acme-prod-eus-api-vm-01
acme-prod-eus-api-db-primary
acme-stage-wus-web-lb-01
acme-dev-eus-cache-redis-01

Cloud Provider Naming

AWS Resources:

Resource

Pattern

Example

S3 Bucket

{org}-{env}-{purpose}-{random}

acme-prod-data-lake-x7k2

EC2 Instance

{env}-{app}-{role}-{az}-{num}

prod-api-web-1a-01

RDS Instance

{env}-{app}-{purpose}-{num}

prod-api-db-01

Lambda Function

{env}-{app}-{purpose}

prod-api-order-processor

VPC

{env}-{purpose}-vpc

prod-app-vpc

Security Group

{env}-{app}-{purpose}-sg

prod-api-web-sg

Azure Resources:

Resource

Pattern

Example

Resource Group

rg-{env}-{app}-{location}

rg-prod-api-eastus

Virtual Machine

vm-{env}-{app}-{num}

vm-prod-web-01

SQL Database

sqldb-{env}-{app}

sqldb-prod-orders

Storage Account

st{env}{app}{random}

stprodapix7k2 (no hyphens allowed)

App Service

app-{env}-{app}

app-prod-api

Key Vault

kv-{env}-{app}

kv-prod-api

GCP Resources:

Resource

Pattern

Example

Project

{org}-{env}-{app}

acme-prod-api

Compute Instance

{env}-{app}-{role}-{num}

prod-api-web-01

Cloud Storage Bucket

{org}-{env}-{purpose}

acme-prod-backups

Cloud SQL Instance

{env}-{app}-db-{num}

prod-api-db-01

Terraform - Automated Naming

# Local variables for naming
locals {
  # Organization prefix
  org = "acme"
  
  # Standard naming components
  naming = {
    prefix = "${local.org}-${var.environment}-${var.location_short}"
    suffix = var.instance_number != null ? "-${format("%02d", var.instance_number)}" : ""
  }
  
  # Resource-specific names
  resource_names = {
    resource_group  = "rg-${local.naming.prefix}-${var.application}"
    virtual_machine = "vm-${local.naming.prefix}-${var.application}${local.naming.suffix}"
    storage_account = replace("st${var.environment}${var.application}${random_string.storage_suffix.result}", "-", "")
    key_vault       = "kv-${local.naming.prefix}-${var.application}"
  }
}

# Random suffix for globally unique names
resource "random_string" "storage_suffix" {
  length  = 4
  special = false
  upper   = false
}

# Use standardized names
resource "azurerm_resource_group" "main" {
  name     = local.resource_names.resource_group
  location = var.location
}

resource "azurerm_storage_account" "main" {
  name                     = local.resource_names.storage_account
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

Tagging Strategy

Already covered extensively in Resource Organization Strategies section above.

Multi-Region Architecture

Design Patterns

Pattern 1: Active-Active (High Availability)

Deploy fully operational instances in multiple regions, load-balanced globally.

Use Cases:

Global SaaS applications
Maximum uptime requirements (99.99%+)
Minimize latency for global users

Trade-offs:

Higher cost (duplicate infrastructure)
Complex data synchronization
More operational complexity

Pattern 2: Active-Passive (Disaster Recovery)

Primary region active, secondary region on standby.

Use Cases:

Cost optimization
Regulatory requirements (data residency)
RTO 15-60 minutes acceptable

Trade-offs:

Lower cost (minimal standby capacity)
Longer recovery time
Periodic testing required

Pattern 3: Region-Specific Services

Different regions serve different purposes or customer segments.

Use Cases:

Data residency requirements (GDPR, China data laws)
Compliance segregation
Market-specific features

Implementation Considerations

Data Synchronization:

Synchronous replication: Zero data loss, higher latency
Asynchronous replication: Potential data loss, lower latency
Conflict resolution: Last-write-wins, vector clocks, application logic

DNS Routing:

Geolocation: Route based on user's geographic location
Latency-based: Route to lowest latency endpoint
Failover: Automatic failover to healthy region
Weighted: Control traffic distribution

Cost Optimization:

Cross-region data transfer: $0.02/GB (expensive at scale)
Replicate only essential data
Use content delivery networks (CDNs) for static content
Consider active-passive for non-critical workloads

Disaster Recovery Considerations

Recovery Objectives

RTO (Recovery Time Objective): How long can the business tolerate downtime?

RPO (Recovery Point Objective): How much data can the business afford to lose?

Example:

E-commerce site during holiday season: RTO=15 minutes, RPO=1 minute
Internal HR system: RTO=4 hours, RPO=24 hours

DR Strategies

Backup & Restore (Lowest Cost, Slowest Recovery)

RTO: Hours to days
RPO: Hours
Cost: Low (storage only)
Implementation: Automated backups, cross-region replication

Pilot Light (Minimal Core)

RTO: 10s of minutes
RPO: Minutes
Cost: Medium (core infrastructure running)
Implementation: Database replicas, AMIs ready, scale on demand

Warm Standby (Reduced Capacity)

RTO: Minutes
RPO: Seconds
Cost: Medium-High (scaled-down production)
Implementation: Minimal compute running, auto-scale on failover

Active-Active (Zero Downtime)

RTO: Seconds (automatic)
RPO: Near-zero
Cost: High (full duplication)
Implementation: Global load balancing, data replication

Testing DR

Monthly:

Restore backups to test environment
Verify backup integrity
Document restoration time

Quarterly:

Activate pilot light/warm standby
Test application functionality
Measure actual RTO/RPO

Annually:

Full DR failover simulation
Executive-level tabletop exercise
Update runbooks based on learnings

Scalability Patterns

Design for Growth

Start Small, Scale Incrementally:

Begin with 10-20 accounts
Establish patterns that work at 100+ accounts
Automate from day one (even if "overkill" initially)

Account Vending Automation

Self-Service Portal:

Team requests account via form
Automated approval workflow
Terraform provisions account + baseline
Account ready in 15 minutes

Benefits:

Platform team doesn't bottleneck growth
Consistent configuration every time
Scales to 100s of accounts without growing team

Infrastructure as Code Modularity

Terraform Module Structure:

terraform/
├── modules/
│   ├── account-baseline/      # Security baseline for new accounts
│   ├── vpc/                   # Standardized VPC
│   ├── application/           # Application deployment pattern
│   ├── database/              # Database deployment pattern
│   └── monitoring/            # Monitoring stack
├── accounts/
│   ├── account-001/
│   ├── account-002/
│   └── ...
└── shared/
    ├── transit-gateway/
    ├── logging/
    └── security/

Module Reuse:

# Reuse VPC module for consistency
module "vpc" {
  source = "../../modules/vpc"
  
  environment    = "prod"
  vpc_cidr       = var.vpc_cidr
  subnet_config  = var.subnet_config
  transit_gateway_id = data.aws_ec2_transit_gateway.main.id
}

# Reuse application module
module "application" {
  source = "../../modules/application"
  
  vpc_id         = module.vpc.vpc_id
  environment    = "prod"
  app_name       = "payment-api"
  instance_type  = "t3.large"
  min_capacity   = 3
  max_capacity   = 10
}

Policy as Code

Manage all policies in Git, deploy via CI/CD.

Benefits:

Version control for policies
Code review for policy changes
Automated testing
Consistent deployment

Common Architecture Anti-Patterns

Anti-Pattern 1: Everything in One Account

Symptoms:

Production, staging, dev in one account
Multiple teams sharing same account
No blast radius isolation

Consequences:

Developer error affects production
Can't segregate costs
Compliance violations
One policy for all environments (conflicts)

Fix: Separate accounts by environment and team.

Anti-Pattern 2: No Automation

Symptoms:

Manual account setup (50-page runbook)
Clicking through console
Copy-paste configuration

Consequences:

Takes weeks to provision accounts
Inconsistent configuration
Human errors
Doesn't scale

Fix: Automate account provisioning with Terraform.

Anti-Pattern 3: Unplanned IP Addresses

Symptoms:

Teams pick their own CIDRs
Overlapping IP ranges
Can't peer VPCs

Consequences:

Network connectivity impossible
Have to renumber VPCs (months of work)
Applications break during migration

Fix: Centralized IP address management (IPAM).

Anti-Pattern 4: Security as Afterthought

Symptoms:

"We'll add MFA later"
Unencrypted databases
No logging
Weak IAM policies

Consequences:

Security incidents
Compliance failures
Audit findings
Difficult to retrofit security

Fix: Security baseline mandatory from day one.

Anti-Pattern 5: No DR Plan

Symptoms:

Backups disabled ("too expensive")
Single region deployment
No failover testing

Consequences:

Data loss in outages
Long recovery times
Business impact
Regulatory violations

Fix: Define RTO/RPO, implement appropriate DR strategy, test regularly.

Anti-Pattern 6: Over-Centralization

Symptoms:

Platform team controls everything
2-week wait for account provisioning
Teams can't self-service

Consequences:

Platform team is bottleneck
Slow development velocity
Shadow IT (teams work around controls)
Frustrated developers

Fix: Self-service with guardrails (automated provisioning, policy enforcement).

Anti-Pattern 7: No Tagging

Symptoms:

Resources without tags
Can't identify ownership
Can't allocate costs

Consequences:

Surprise bills
Can't optimize spending
Can't find resources
No accountability

Fix: Mandatory tagging policy enforced via SCPs/Azure Policy.

What I Learned About Landing Zone Design

Lesson 1: Design for Future Scale

Don't optimize for 10 accounts if you'll have 100 in 2 years. Reorganization is expensive and disruptive.

Action: Design OU/MG structure for anticipated growth, even if "overkill" initially.

Lesson 2: Security Isn't Optional

Build security into the foundation. Retrofitting is 10x harder than building it in.

Action: Mandatory encryption, logging, MFA enforced via SCPs/Policies from day one.

Lesson 3: Automation Unlocks Scale

Manual processes break at 50 accounts. Automated processes scale to 1,000+.

Action: Invest in account vending automation early, even if it seems excessive initially.

Lesson 4: Network Planning is Critical

One day of IP planning saves months of renumbering later.

Action: Create IP address management plan before deploying first workload.

Lesson 5: Observability is Non-Negotiable

Centralized logging saved us during security incidents. It's not optional.

Action: Centralized logging to immutable storage, SIEM integration, CloudTrail/Activity Logs mandatory.

Lesson 6: Test Your DR

Untested DR is fantasy. Schedule regular DR tests.

Action: Monthly backup restoration, quarterly DR activation, annual full failover.

Lesson 7: Policy as Code

Manage governance as code for version control, review, and automation.

Action: All SCPs/Azure Policies in Git, deployed via Terraform/CI-CD.

Lesson 8: Balance Control and Autonomy

Too much control: slow, frustrated teams. Too much autonomy: chaos, security gaps.

Action: Self-service account provisioning with policy guardrails. Teams can provision, but can't bypass security.

Next Up: Identity, Access, and Security Foundations

In Article 3, we'll dive deep into IAM strategy, SSO implementation, zero-trust architecture, and building security into every layer of your landing zone.

Ready to secure your cloud? Let's go! 🔐

PreviousIntroduction to Cloud Landing Zones NextIdentity, Access, and Security Foundations

Last updated 1 month ago

hashtagTable of Contents

hashtagIntroduction

hashtagCore Design Principles

hashtagPrinciple 1: Security by Default

hashtagPrinciple 2: Separation of Concerns

hashtagPrinciple 3: Least Privilege Access

hashtagPrinciple 4: Automation Over Manual Processes

hashtagPrinciple 5: Observable and Auditable

hashtagPrinciple 6: Immutable Infrastructure

hashtagPrinciple 7: Defense in Depth

hashtagPrinciple 8: Cattle, Not Pets

hashtagAccount/Subscription Organization Patterns

hashtagPattern 1: Environment-Based Organization

hashtagPattern 2: Business Unit Organization

hashtagPattern 3: Hybrid (Environment × Business Unit)

hashtagPattern 4: Workload-Based Organization

hashtagRecommended Pattern for Most Organizations

hashtagMy Hard-Learned Lesson

hashtagManagement Group and OU Hierarchies

hashtagAWS Organizations - Organizational Units (OUs)

hashtagAzure Management Groups

hashtagGCP Folder Hierarchies

hashtagHierarchy Design Best Practices

hashtagHub-and-Spoke Network Topology

hashtagThe Problem: Mesh Networking Doesn't Scale

hashtagThe Solution: Hub-and-Spoke

hashtagAWS Implementation - Transit Gateway

hashtagAzure Implementation - Hub VNet with Peering

hashtagTraffic Flow with Inspection

hashtagIP Address Planning

hashtagResource Organization Strategies

hashtagTagging Strategy

hashtagNaming Conventions

hashtagResource Group Strategy (Azure)

hashtagNaming Conventions and Standards

hashtagCore Principles

hashtagStandard Naming Pattern

hashtagCloud Provider Naming

hashtagTerraform - Automated Naming

hashtagTagging Strategy

hashtagMulti-Region Architecture

hashtagDesign Patterns

hashtagImplementation Considerations

hashtagDisaster Recovery Considerations

hashtagRecovery Objectives

hashtagDR Strategies

hashtagTesting DR

hashtagScalability Patterns

hashtagDesign for Growth

hashtagAccount Vending Automation

hashtagInfrastructure as Code Modularity

hashtagPolicy as Code

hashtagCommon Architecture Anti-Patterns

hashtagAnti-Pattern 1: Everything in One Account

hashtagAnti-Pattern 2: No Automation

hashtagAnti-Pattern 3: Unplanned IP Addresses

hashtagAnti-Pattern 4: Security as Afterthought

hashtagAnti-Pattern 5: No DR Plan

hashtagAnti-Pattern 6: Over-Centralization

hashtagAnti-Pattern 7: No Tagging

hashtagWhat I Learned About Landing Zone Design

hashtagLesson 1: Design for Future Scale

hashtagLesson 2: Security Isn't Optional

hashtagLesson 3: Automation Unlocks Scale

hashtagLesson 4: Network Planning is Critical

hashtagLesson 5: Observability is Non-Negotiable

hashtagLesson 6: Test Your DR

hashtagLesson 7: Policy as Code

hashtagLesson 8: Balance Control and Autonomy

Table of Contents

Introduction

Core Design Principles

Principle 1: Security by Default

Principle 2: Separation of Concerns

Principle 3: Least Privilege Access

Principle 4: Automation Over Manual Processes

Principle 5: Observable and Auditable

Principle 6: Immutable Infrastructure

Principle 7: Defense in Depth

Principle 8: Cattle, Not Pets

Account/Subscription Organization Patterns

Pattern 1: Environment-Based Organization

Pattern 2: Business Unit Organization

Pattern 3: Hybrid (Environment × Business Unit)

Pattern 4: Workload-Based Organization

Recommended Pattern for Most Organizations

My Hard-Learned Lesson

Management Group and OU Hierarchies

AWS Organizations - Organizational Units (OUs)

Azure Management Groups

GCP Folder Hierarchies

Hierarchy Design Best Practices

Hub-and-Spoke Network Topology

The Problem: Mesh Networking Doesn't Scale

The Solution: Hub-and-Spoke

AWS Implementation - Transit Gateway

Azure Implementation - Hub VNet with Peering

Traffic Flow with Inspection

IP Address Planning

Resource Organization Strategies

Tagging Strategy

Naming Conventions

Resource Group Strategy (Azure)

Naming Conventions and Standards

Core Principles

Standard Naming Pattern

Cloud Provider Naming

Terraform - Automated Naming

Tagging Strategy

Multi-Region Architecture

Design Patterns

Implementation Considerations

Disaster Recovery Considerations

Recovery Objectives

DR Strategies

Testing DR

Scalability Patterns

Design for Growth

Account Vending Automation

Infrastructure as Code Modularity

Policy as Code

Common Architecture Anti-Patterns

Anti-Pattern 1: Everything in One Account

Anti-Pattern 2: No Automation

Anti-Pattern 3: Unplanned IP Addresses

Anti-Pattern 4: Security as Afterthought

Anti-Pattern 5: No DR Plan

Anti-Pattern 6: Over-Centralization

Anti-Pattern 7: No Tagging

What I Learned About Landing Zone Design

Lesson 1: Design for Future Scale

Lesson 2: Security Isn't Optional

Lesson 3: Automation Unlocks Scale

Lesson 4: Network Planning is Critical

Lesson 5: Observability is Non-Negotiable

Lesson 6: Test Your DR

Lesson 7: Policy as Code

Lesson 8: Balance Control and Autonomy