Landing Zone Design Principles and Architecture Patterns

Table of Contents


Introduction

Working with cloud infrastructure at scale has taught me that proper architecture and design principles are the difference between manageable and chaotic cloud environments.

I've worked on several projects where organizations accumulated AWS accounts over time without a clear organizational structure. Common challenges I encountered include:

  • Accounts from discontinued projects remaining active

  • Overlapping IP address ranges preventing network connectivity

  • Inconsistent naming conventions making resources hard to identify

  • Flat organizational structures with no logical grouping

  • Difficulty tracking costs and attributing spending

  • Compliance challenges from inconsistent configurations

These experiences highlighted critical gaps in cloud architecture that stem from not establishing clear design principles upfront. Through redesigning and implementing landing zone architectures across various organizations, I've identified the fundamental principles that create scalable, manageable cloud environments.

This article shares the design principles and architecture patterns I've learned through hands-on experience - what works at scale, what patterns enable growth, and what anti-patterns to avoid.


Core Design Principles

After building and rebuilding landing zones for years, I've distilled the essential principles that separate successful architectures from chaotic ones.

Principle 1: Security by Default

The Principle: Security controls should be automatically applied, not optionally configured.

What This Means:

  • Encryption at rest is required, not suggested

  • Network access is denied by default, allowed by exception

  • Logging is always on, can't be disabled

  • MFA is enforced, not recommended

Why It Matters:

At one company, we made security "configurable." Teams could choose whether to encrypt their databases, enable logging, or use MFA.

Result: 40% of accounts had unencrypted databases. 25% had logging disabled. Root accounts with no MFA.

The Fix: We changed security from opt-in to mandatory:

  • Service Control Policies (SCPs) prevent creating unencrypted resources

  • CloudTrail can't be disabled (SCP prevents it)

  • MFA required for all human access (enforced at identity provider)

Real Example - AWS SCP for Encryption:

The Result: Security violations dropped from 40+ per week to 0-2 per week (and those 2 were quickly caught and fixed).

Principle 2: Separation of Concerns

The Principle: Different types of workloads and teams should be isolated from each other.

Separation Dimensions:

1. Environment Separation

  • Production ≠ Staging ≠ Development

  • No shared resources between environments

  • Different access controls

  • Different change management processes

2. Security Boundary Separation

  • Compliance-regulated data isolated from non-regulated

  • PCI workloads separate from non-PCI

  • HIPAA data isolated

  • Multi-tenant SaaS: Customer A isolated from Customer B

3. Team/Business Unit Separation

  • Marketing team ≠ Engineering team

  • Product A ≠ Product B

  • Each team can operate independently

4. Function Separation

  • Networking separate from compute

  • Logging separate from applications

  • Security services separate from workloads

Visual: Separation Architecture

spinner

Why It Matters:

At one company, they had a single "shared" account for dev, staging, and prod. A developer testing a database migration script accidentally ran it against production. $180,000 in lost orders before we caught it.

After separation: Impossible to accidentally affect prod from dev (different accounts, different credentials, different networks).

Principle 3: Least Privilege Access

The Principle: Users and services get minimum necessary permissions, nothing more.

Implementation:

Role-Based Access Control (RBAC):

Time-Bound Access:

  • Admin access granted for 4 hours, then expires

  • Emergency access auto-expires after 2 hours

  • All privileged access logged and reviewed

Real Example - AWS IAM Policy for Developer:

Principle 4: Automation Over Manual Processes

The Principle: If humans do it more than twice, automate it.

What to Automate:

Account/Subscription Provisioning:

  • ❌ Manual: Fill out form, wait 3 days for networking team, wait 2 days for security review

  • ✅ Automated: Self-service portal, account created in 15 minutes with all controls

Security Baseline Configuration:

  • ❌ Manual: Follow 50-page runbook, miss steps, inconsistent results

  • ✅ Automated: Terraform applies identical baseline every time

Compliance Scanning:

  • ❌ Manual: Weekly reviews, spreadsheets, missed violations

  • ✅ Automated: Continuous scanning, automatic remediation, alerts on drift

Cost Reporting:

  • ❌ Manual: Monthly manual cost allocation, Excel gymnastics

  • ✅ Automated: Automated tagging, real-time chargeback dashboards

Why It Matters:

Manual processes don't scale:

  • 10 accounts: Manual is fine (annoying but manageable)

  • 50 accounts: Manual is painful (full-time job for one person)

  • 100+ accounts: Manual is impossible (team of people can't keep up)

Automated processes scale linearly:

  • 10 accounts: Same effort as 100 accounts

  • Platform team size doesn't grow with account count

  • Consistency guaranteed (computers don't forget steps)

Principle 5: Observable and Auditable

The Principle: Everything that happens should be logged, searchable, and traceable.

What to Log:

API Calls (CloudTrail, Activity Logs, Audit Logs):

  • Who did what, when, from where

  • Successful and failed attempts

  • Changes to infrastructure

  • Access to sensitive data

Network Traffic (VPC Flow Logs):

  • Source and destination IPs

  • Ports and protocols

  • Accepted and rejected connections

  • Traffic patterns for anomaly detection

Application Logs:

  • Application events

  • Errors and exceptions

  • Performance metrics

  • User actions

Security Events:

  • Login attempts (successful and failed)

  • Permission changes

  • Security group modifications

  • Encryption key usage

Implementation:

spinner

Why It Matters:

Story time: Security incident at 2am. Unauthorized EC2 instance running cryptocurrency mining.

Without centralized logging:

  • Took 6 hours to find which account

  • Another 4 hours to determine how it was created

  • Never definitively identified the attack vector

  • Cost: $14,000 in compute + 10 hours of engineer time

With centralized logging:

  • Query: "Show me all EC2 RunInstances calls in last 24 hours"

  • Found suspicious API call in 5 minutes

  • Traced to compromised access key

  • Revoked key, terminated instances

  • Total time: 20 minutes

  • Cost: $200 in compute

The logging made the difference between a $14K+ incident and a $200 nuisance.

Principle 6: Immutable Infrastructure

The Principle: Infrastructure is replaced, not modified.

What This Means:

❌ Mutable (Old Way):

  • SSH into server

  • Modify configuration files

  • Install packages

  • Server is now a "snowflake" (unique, undocumented)

✅ Immutable (New Way):

  • Define infrastructure in Terraform

  • Deploy new version

  • Destroy old version

  • Infrastructure is code (documented, version-controlled)

Benefits:

Reproducibility:

  • Can recreate entire environment from code

  • No drift (what's deployed matches what's in Git)

  • Disaster recovery is terraform apply

Auditability:

  • All changes in Git history

  • Code review for infrastructure changes

  • Rollback is git revert + terraform apply

Testing:

  • Test infrastructure changes in dev/staging first

  • Validate before applying to production

  • Catch errors before they impact users

Principle 7: Defense in Depth

The Principle: Multiple layers of security, no single point of failure.

Security Layers:

Why It Matters:

If one layer fails, others provide protection:

  • Attacker bypasses firewall → Network segmentation limits access

  • Attacker compromises credentials → MFA prevents login

  • Attacker gains access → Monitoring detects anomaly

  • Attacker modifies resources → Audit logs provide evidence

Real Example:

Attack timeline with defense in depth:

  1. 10:00: Phishing email sent to developer

  2. 10:15: Developer clicks malicious link

  3. 10:20: Attacker harvests credentials

  4. 10:25: Attacker attempts login → MFA blocks (Layer 3)

  5. 10:30: Attacker attempts to bypass MFA → Anomaly detected (Layer 2)

  6. 10:35: Security team alerted → Credentials revoked (Layer 3)

  7. 10:40: Incident contained, zero damage

Without defense in depth: Attacker gains access at 10:25, unknown duration until discovery.

Principle 8: Cattle, Not Pets

The Principle: Infrastructure is disposable and replaceable, not precious and unique.

The Pet Model (Old Way):

  • Servers have names ("prod-db-01")

  • Manually configured and maintained

  • When broken, we nurse them back to health

  • Fear of deleting them (might break something)

  • Irreplaceable

The Cattle Model (New Way):

  • Resources are numbered ("web-server-0042")

  • Automatically provisioned from code

  • When broken, we destroy and replace

  • No fear of deletion (can recreate anytime)

  • Disposable

Implementation:

spinner

Why It Matters:

Disaster Recovery:

  • Pets: Must restore specific server (slow, risky)

  • Cattle: Launch new instances from code (fast, reliable)

Scaling:

  • Pets: Manually provision new servers (weeks)

  • Cattle: Auto-scaling handles it (minutes)

Updates:

  • Pets: SSH and update each server (error-prone)

  • Cattle: Deploy new version, drain old instances


Account/Subscription Organization Patterns

The foundation of landing zone design is how you organize accounts or subscriptions. Get this wrong, and everything built on top is compromised.

Pattern 1: Environment-Based Organization

Structure:

When to Use:

  • Small to medium organizations (<100 accounts)

  • Single product or closely related products

  • Simple organizational structure

Advantages:

  • ✅ Clear environment separation

  • ✅ Easy to understand

  • ✅ Simple policy application (all prod accounts get prod policies)

Disadvantages:

  • ❌ Doesn't scale well with multiple business units

  • ❌ Mixed teams in same OU (marketing and engineering both in prod)

  • ❌ Hard to allocate costs to different business units

Real Example:

SaaS startup with single product:

Pattern 2: Business Unit Organization

Structure:

When to Use:

  • Multiple business units with independent P&Ls

  • Distributed teams (different geographies, different management)

  • Need for cost allocation by business unit

Advantages:

  • ✅ Clear cost ownership

  • ✅ Independent operation (BU-A can't affect BU-B)

  • ✅ Different policies per business unit (if needed)

Disadvantages:

  • ❌ Potential duplication (each BU builds own shared services)

  • ❌ Less standardization across org

  • ❌ More complex to manage

Real Example:

E-commerce company with multiple brands:

Pattern 3: Hybrid (Environment × Business Unit)

Structure:

When to Use:

  • Medium to large organizations (100-500 accounts)

  • Need both environment policies AND business unit separation

  • Centralized platform team with distributed application teams

Advantages:

  • ✅ Environment-level policies (all prod accounts locked down)

  • ✅ Business unit cost allocation (tagging)

  • ✅ Combines benefits of both patterns

Disadvantages:

  • ❌ More complex hierarchy

  • ❌ Two dimensions to manage (environment AND business unit)

Pattern 4: Workload-Based Organization

Structure:

When to Use:

  • Specialized workload types with different requirements

  • Different compliance needs (PCI vs HIPAA vs general)

  • Different architectural patterns per workload type

Advantages:

  • ✅ Workload-specific policies (data lake accounts get different policies than web accounts)

  • ✅ Specialized configuration per workload type

  • ✅ Clear functional boundaries

Disadvantages:

  • ❌ Can be confusing for teams ("where does this workload go?")

  • ❌ Doesn't inherently separate environments

After trying all these patterns, here's what I recommend:

For startups and small companies (<50 accounts):

For mid-size companies (50-200 accounts):

For large enterprises (200+ accounts):

My Hard-Learned Lesson

At one company, we started with environment-based (simple). As we grew to 8 business units, we tried to retrofit business unit organization.

The migration was a nightmare:

  • 6 months to reorganize 200 accounts

  • Broke network connectivity during migration

  • Confused teams ("wait, which OU is my account in now?")

  • Cost: 4 engineers × 6 months = $360,000 in labor

The lesson: Design for your 3-year state, not your current state.

If you anticipate multiple business units, design for it from day one. Reorganizing later is 10x harder.


Management Group and OU Hierarchies

The management hierarchy defines how policies and governance flow down through your organization. This is where you enforce standards that can't be violated.

AWS Organizations - Organizational Units (OUs)

Structure:

Service Control Policies (SCPs) - The Guardrails:

SCPs are the ultimate authority. Even root users can't bypass SCPs.

Example - Prevent Leaving Organization:

Example - Require Encryption:

Example - Restrict Regions:

Real Story - SCPs Saved Us:

At one company, a contractor accidentally ran:

Why? They were used to working in ap-south-1 at their previous client.

Without region SCP: 1TB volume created in unsupported region, $120/month until discovered 6 months later = $720 wasted.

With region SCP: Request immediately denied. "Access Denied: Region ap-south-1 is not approved." $0 wasted.

SCPs are guardrails that prevent costly mistakes.

Azure Management Groups

Structure:

Azure Policy at Management Group Level:

Example - Require Encryption:

Example - Allowed Regions:

Terraform for Azure Policy Assignment:

GCP Folder Hierarchies

Structure:

Organization Policies:

Example - Restrict External IPs:

Example - Allowed Regions:

Hierarchy Design Best Practices

Principle 1: Policy Inheritance

Policies flow downward and accumulate:

Principle 2: Start Restrictive, Selectively Allow

Principle 3: Separate Platform from Workloads

My Painful Lesson:

At one company, we put ALL accounts under the same OU with the same policies.

Problem: Network team needed to create VPCs, but SCP denied VPC creation (security team added this).

Workaround: Security team exempted specific IAM roles from SCP.

Result: Complex SCP with 40+ exemptions. Impossible to understand. Security gaps from over-complicated logic.

The Fix: Separate OUs for platform vs workloads:

  • Platform OU: Different policies (allows infrastructure provisioning)

  • Workload OU: Standard restrictive policies

Lesson: Design OUs around different policy needs, not arbitrary organizational structure.


Hub-and-Spoke Network Topology

Network architecture is the foundation of your landing zone. Get it wrong, and you'll spend years untangling it.

The Problem: Mesh Networking Doesn't Scale

Mesh (Point-to-Point) Peering:

spinner

Number of peering connections:

  • 4 accounts = 6 peering connections

  • 10 accounts = 45 peering connections

  • 50 accounts = 1,225 peering connections (!!)

  • 100 accounts = 4,950 peering connections (!!!)

Formula: n * (n-1) / 2

Problems:

  • ❌ Management nightmare

  • ❌ IP address conflicts

  • ❌ No centralized security inspection

  • ❌ Difficult to implement shared services

  • ❌ Scales as O(n²)

The Solution: Hub-and-Spoke

Hub-and-Spoke Topology:

spinner

Number of connections:

  • 4 accounts = 4 connections to hub

  • 10 accounts = 10 connections

  • 50 accounts = 50 connections

  • 100 accounts = 100 connections

Scales as O(n) instead of O(n²)!

AWS Implementation - Transit Gateway

Architecture:

Routing Rules:

Azure Implementation - Hub VNet with Peering

Architecture:

Traffic Flow with Inspection

Spoke-to-Spoke Traffic (with Firewall Inspection):

spinner

User-Defined Routes (UDRs) Force Traffic Through Firewall:

IP Address Planning

Critical for Success:

Network Segment
CIDR Block
Usable IPs
Purpose

Hub VNet

10.0.0.0/16

65,536

Central connectivity

Production Spoke 1

10.1.0.0/16

65,536

App A production

Production Spoke 2

10.2.0.0/16

65,536

App B production

Staging Spoke

10.10.0.0/16

65,536

Staging environments

Development Spoke

10.20.0.0/16

65,536

Development sandboxes

Shared Services

10.100.0.0/16

65,536

DNS, AD, Monitoring

On-Premises

192.168.0.0/16

65,536

Corporate network

Rules:

  • No overlaps: Each spoke has unique CIDR

  • Room to grow: Leave gaps for future spokes

  • Consistent sizing: Similar-sized blocks for easier management

  • Document everything: IP address registry is essential

My Painful Story:

At one company, they didn't plan IP addresses. Each team picked their own CIDRs.

Result:

  • 40% of VPCs used 10.0.0.0/24

  • Couldn't peer them (overlapping IPs)

  • Had to renumber 15 VPCs (months of work)

  • Broke applications during migration

  • Cost: 3 engineers × 4 months = $180,000

The lesson: Spend 1 day planning IP addresses to avoid months of rework.


Resource Organization Strategies

Beyond accounts/subscriptions, how do you organize resources within each account?

Tagging Strategy

Tags are metadata attached to resources. They enable:

  • Cost allocation

  • Resource discovery

  • Automation

  • Compliance tracking

Required Tags (Mandatory on All Resources):

Tag Key
Values
Purpose
Example

Environment

prod, staging, dev

Environment classification

Environment=prod

CostCenter

Business unit code

Cost allocation

CostCenter=engineering

Owner

Email or team name

Accountability

Application

Application name

Workload identification

Application=payment-api

DataClassification

public, internal, confidential, restricted

Security and compliance

DataClassification=confidential

Compliance

pci, hipaa, sox, none

Regulatory requirements

Compliance=pci

Optional Tags (Recommended):

Tag Key
Purpose
Example

Project

Project tracking

Project=mobile-app-redesign

Terraform

IaC management

Terraform=true

BackupPolicy

Backup requirements

BackupPolicy=daily-7day-retention

ManagedBy

Automation tool

ManagedBy=terraform

CreatedDate

Resource creation tracking

CreatedDate=2024-01-15

ExpirationDate

Cleanup automation

ExpirationDate=2024-06-30

Terraform - Enforce Tagging:

Azure - Require tags via Policy:

Automated Tagging in Terraform:

Naming Conventions

Consistent naming makes resources discoverable and reduces confusion.

Pattern:

Account/Subscription Names:

VPC/VNet Names:

Terraform - Enforce Naming via Validation:

Resource Group Strategy (Azure)

In Azure, Resource Groups are containers for related resources.

Pattern 1: One Resource Group per Environment per Application

Pattern 2: Separate by Resource Lifecycle

Terraform - Azure Resource Groups:


Naming Conventions and Standards

Naming conventions are essential for resource discovery, automation, and reducing operational errors.

Core Principles

1. Predictable: Anyone should be able to guess the name format 2. Descriptive: Name should convey purpose 3. Unique: No name collisions 4. Sortable: Logical alphabetical ordering 5. Automatable: Easy to generate programmatically

Standard Naming Pattern

General Format:

Examples:

Cloud Provider Naming

AWS Resources:

Resource
Pattern
Example

S3 Bucket

{org}-{env}-{purpose}-{random}

acme-prod-data-lake-x7k2

EC2 Instance

{env}-{app}-{role}-{az}-{num}

prod-api-web-1a-01

RDS Instance

{env}-{app}-{purpose}-{num}

prod-api-db-01

Lambda Function

{env}-{app}-{purpose}

prod-api-order-processor

VPC

{env}-{purpose}-vpc

prod-app-vpc

Security Group

{env}-{app}-{purpose}-sg

prod-api-web-sg

Azure Resources:

Resource
Pattern
Example

Resource Group

rg-{env}-{app}-{location}

rg-prod-api-eastus

Virtual Machine

vm-{env}-{app}-{num}

vm-prod-web-01

SQL Database

sqldb-{env}-{app}

sqldb-prod-orders

Storage Account

st{env}{app}{random}

stprodapix7k2 (no hyphens allowed)

App Service

app-{env}-{app}

app-prod-api

Key Vault

kv-{env}-{app}

kv-prod-api

GCP Resources:

Resource
Pattern
Example

Project

{org}-{env}-{app}

acme-prod-api

Compute Instance

{env}-{app}-{role}-{num}

prod-api-web-01

Cloud Storage Bucket

{org}-{env}-{purpose}

acme-prod-backups

Cloud SQL Instance

{env}-{app}-db-{num}

prod-api-db-01

Terraform - Automated Naming


Tagging Strategy

Already covered extensively in Resource Organization Strategies section above.


Multi-Region Architecture

Design Patterns

Pattern 1: Active-Active (High Availability)

Deploy fully operational instances in multiple regions, load-balanced globally.

Use Cases:

  • Global SaaS applications

  • Maximum uptime requirements (99.99%+)

  • Minimize latency for global users

Trade-offs:

  • Higher cost (duplicate infrastructure)

  • Complex data synchronization

  • More operational complexity

Pattern 2: Active-Passive (Disaster Recovery)

Primary region active, secondary region on standby.

Use Cases:

  • Cost optimization

  • Regulatory requirements (data residency)

  • RTO 15-60 minutes acceptable

Trade-offs:

  • Lower cost (minimal standby capacity)

  • Longer recovery time

  • Periodic testing required

Pattern 3: Region-Specific Services

Different regions serve different purposes or customer segments.

Use Cases:

  • Data residency requirements (GDPR, China data laws)

  • Compliance segregation

  • Market-specific features

Implementation Considerations

Data Synchronization:

  • Synchronous replication: Zero data loss, higher latency

  • Asynchronous replication: Potential data loss, lower latency

  • Conflict resolution: Last-write-wins, vector clocks, application logic

DNS Routing:

  • Geolocation: Route based on user's geographic location

  • Latency-based: Route to lowest latency endpoint

  • Failover: Automatic failover to healthy region

  • Weighted: Control traffic distribution

Cost Optimization:

  • Cross-region data transfer: $0.02/GB (expensive at scale)

  • Replicate only essential data

  • Use content delivery networks (CDNs) for static content

  • Consider active-passive for non-critical workloads


Disaster Recovery Considerations

Recovery Objectives

RTO (Recovery Time Objective): How long can the business tolerate downtime?

RPO (Recovery Point Objective): How much data can the business afford to lose?

Example:

  • E-commerce site during holiday season: RTO=15 minutes, RPO=1 minute

  • Internal HR system: RTO=4 hours, RPO=24 hours

DR Strategies

Backup & Restore (Lowest Cost, Slowest Recovery)

  • RTO: Hours to days

  • RPO: Hours

  • Cost: Low (storage only)

  • Implementation: Automated backups, cross-region replication

Pilot Light (Minimal Core)

  • RTO: 10s of minutes

  • RPO: Minutes

  • Cost: Medium (core infrastructure running)

  • Implementation: Database replicas, AMIs ready, scale on demand

Warm Standby (Reduced Capacity)

  • RTO: Minutes

  • RPO: Seconds

  • Cost: Medium-High (scaled-down production)

  • Implementation: Minimal compute running, auto-scale on failover

Active-Active (Zero Downtime)

  • RTO: Seconds (automatic)

  • RPO: Near-zero

  • Cost: High (full duplication)

  • Implementation: Global load balancing, data replication

Testing DR

Monthly:

  • Restore backups to test environment

  • Verify backup integrity

  • Document restoration time

Quarterly:

  • Activate pilot light/warm standby

  • Test application functionality

  • Measure actual RTO/RPO

Annually:

  • Full DR failover simulation

  • Executive-level tabletop exercise

  • Update runbooks based on learnings


Scalability Patterns

Design for Growth

Start Small, Scale Incrementally:

  • Begin with 10-20 accounts

  • Establish patterns that work at 100+ accounts

  • Automate from day one (even if "overkill" initially)

Account Vending Automation

Self-Service Portal:

  • Team requests account via form

  • Automated approval workflow

  • Terraform provisions account + baseline

  • Account ready in 15 minutes

Benefits:

  • Platform team doesn't bottleneck growth

  • Consistent configuration every time

  • Scales to 100s of accounts without growing team

Infrastructure as Code Modularity

Terraform Module Structure:

Module Reuse:

Policy as Code

Manage all policies in Git, deploy via CI/CD.

Benefits:

  • Version control for policies

  • Code review for policy changes

  • Automated testing

  • Consistent deployment


Common Architecture Anti-Patterns

Anti-Pattern 1: Everything in One Account

Symptoms:

  • Production, staging, dev in one account

  • Multiple teams sharing same account

  • No blast radius isolation

Consequences:

  • Developer error affects production

  • Can't segregate costs

  • Compliance violations

  • One policy for all environments (conflicts)

Fix: Separate accounts by environment and team.

Anti-Pattern 2: No Automation

Symptoms:

  • Manual account setup (50-page runbook)

  • Clicking through console

  • Copy-paste configuration

Consequences:

  • Takes weeks to provision accounts

  • Inconsistent configuration

  • Human errors

  • Doesn't scale

Fix: Automate account provisioning with Terraform.

Anti-Pattern 3: Unplanned IP Addresses

Symptoms:

  • Teams pick their own CIDRs

  • Overlapping IP ranges

  • Can't peer VPCs

Consequences:

  • Network connectivity impossible

  • Have to renumber VPCs (months of work)

  • Applications break during migration

Fix: Centralized IP address management (IPAM).

Anti-Pattern 4: Security as Afterthought

Symptoms:

  • "We'll add MFA later"

  • Unencrypted databases

  • No logging

  • Weak IAM policies

Consequences:

  • Security incidents

  • Compliance failures

  • Audit findings

  • Difficult to retrofit security

Fix: Security baseline mandatory from day one.

Anti-Pattern 5: No DR Plan

Symptoms:

  • Backups disabled ("too expensive")

  • Single region deployment

  • No failover testing

Consequences:

  • Data loss in outages

  • Long recovery times

  • Business impact

  • Regulatory violations

Fix: Define RTO/RPO, implement appropriate DR strategy, test regularly.

Anti-Pattern 6: Over-Centralization

Symptoms:

  • Platform team controls everything

  • 2-week wait for account provisioning

  • Teams can't self-service

Consequences:

  • Platform team is bottleneck

  • Slow development velocity

  • Shadow IT (teams work around controls)

  • Frustrated developers

Fix: Self-service with guardrails (automated provisioning, policy enforcement).

Anti-Pattern 7: No Tagging

Symptoms:

  • Resources without tags

  • Can't identify ownership

  • Can't allocate costs

Consequences:

  • Surprise bills

  • Can't optimize spending

  • Can't find resources

  • No accountability

Fix: Mandatory tagging policy enforced via SCPs/Azure Policy.


What I Learned About Landing Zone Design

Lesson 1: Design for Future Scale

Don't optimize for 10 accounts if you'll have 100 in 2 years. Reorganization is expensive and disruptive.

Action: Design OU/MG structure for anticipated growth, even if "overkill" initially.

Lesson 2: Security Isn't Optional

Build security into the foundation. Retrofitting is 10x harder than building it in.

Action: Mandatory encryption, logging, MFA enforced via SCPs/Policies from day one.

Lesson 3: Automation Unlocks Scale

Manual processes break at 50 accounts. Automated processes scale to 1,000+.

Action: Invest in account vending automation early, even if it seems excessive initially.

Lesson 4: Network Planning is Critical

One day of IP planning saves months of renumbering later.

Action: Create IP address management plan before deploying first workload.

Lesson 5: Observability is Non-Negotiable

Centralized logging saved us during security incidents. It's not optional.

Action: Centralized logging to immutable storage, SIEM integration, CloudTrail/Activity Logs mandatory.

Lesson 6: Test Your DR

Untested DR is fantasy. Schedule regular DR tests.

Action: Monthly backup restoration, quarterly DR activation, annual full failover.

Lesson 7: Policy as Code

Manage governance as code for version control, review, and automation.

Action: All SCPs/Azure Policies in Git, deployed via Terraform/CI-CD.

Lesson 8: Balance Control and Autonomy

Too much control: slow, frustrated teams. Too much autonomy: chaos, security gaps.

Action: Self-service account provisioning with policy guardrails. Teams can provision, but can't bypass security.


Next Up: Identity, Access, and Security Foundations

In Article 3, we'll dive deep into IAM strategy, SSO implementation, zero-trust architecture, and building security into every layer of your landing zone.

Ready to secure your cloud? Let's go! 🔐

Last updated