Migration and Workload Onboarding

Article 11 of 12 in the Cloud Landing Zone Series

Introduction

Building a landing zone is only half the challenge - the real test is migrating existing workloads into it.

Through migration projects moving applications from on-premises data centers and brownfield cloud accounts into proper landing zones, I've seen common challenges:

Underestimating application complexity and dependencies
Discovering undocumented dependencies during migration
Missing cutover windows due to unforeseen issues
Emergency rollbacks when migrations don't go as planned
Insufficient testing before production cutover

Migration is where technical plans meet operational reality. What looked straightforward on paper becomes complex when dealing with legacy applications, hidden dependencies, and business continuity requirements.

This article shares the migration patterns and strategies I've learned through hands-on experience - covering discovery and assessment, dependency mapping, migration patterns, wave planning, testing strategies, and how to execute successful cutovers with minimal risk.

Discovery and Assessment

Application Discovery

# scripts/discover_applications.py
import boto3
import json
from collections import defaultdict

def discover_aws_applications():
    """
    Discover all applications across existing AWS accounts
    """
    
    ec2 = boto3.client('ec2')
    rds = boto3.client('rds')
    elbv2 = boto3.client('elbv2')
    
    applications = defaultdict(lambda: {
        'compute': [],
        'databases': [],
        'load_balancers': [],
        'storage': [],
        'dependencies': []
    })
    
    # Discover EC2 instances grouped by Application tag
    instances = ec2.describe_instances()
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            app_name = get_tag(instance, 'Application') or 'untagged'
            
            applications[app_name]['compute'].append({
                'type': 'EC2',
                'id': instance['InstanceId'],
                'instance_type': instance['InstanceType'],
                'vpc_id': instance.get('VpcId'),
                'subnet_id': instance.get('SubnetId'),
                'security_groups': [sg['GroupId'] for sg in instance.get('SecurityGroups', [])],
                'private_ip': instance.get('PrivateIpAddress'),
                'public_ip': instance.get('PublicIpAddress'),
                'tags': instance.get('Tags', [])
            })
    
    # Discover RDS databases
    databases = rds.describe_db_instances()
    for db in databases['DBInstances']:
        app_name = get_tag_rds(db, 'Application') or 'untagged'
        
        applications[app_name]['databases'].append({
            'type': 'RDS',
            'id': db['DBInstanceIdentifier'],
            'engine': db['Engine'],
            'size': db['DBInstanceClass'],
            'multi_az': db['MultiAZ'],
            'storage_encrypted': db['StorageEncrypted']
        })
    
    # Discover Load Balancers
    load_balancers = elbv2.describe_load_balancers()
    for lb in load_balancers['LoadBalancers']:
        app_name = get_tag_elb(lb, 'Application') or 'untagged'
        
        # Find target instances
        target_groups = elbv2.describe_target_groups(
            LoadBalancerArn=lb['LoadBalancerArn']
        )
        
        targets = []
        for tg in target_groups['TargetGroups']:
            health = elbv2.describe_target_health(
                TargetGroupArn=tg['TargetGroupArn']
            )
            targets.extend([t['Target']['Id'] for t in health['TargetHealthDescriptions']])
        
        applications[app_name]['load_balancers'].append({
            'type': 'ALB' if lb['Type'] == 'application' else 'NLB',
            'arn': lb['LoadBalancerArn'],
            'dns_name': lb['DNSName'],
            'targets': targets
        })
    
    # Generate dependency map
    for app_name, app in applications.items():
        app['dependencies'] = discover_dependencies(app)
    
    return dict(applications)

def discover_dependencies(app):
    """Analyze network traffic to discover application dependencies"""
    
    # Use VPC Flow Logs to discover communication patterns
    # This would analyze flow logs to determine which applications communicate
    
    return []

def get_tag(resource, key):
    """Get tag value from AWS resource"""
    for tag in resource.get('Tags', []):
        if tag['Key'] == key:
            return tag['Value']
    return None

# Export to JSON for migration planning
applications = discover_aws_applications()
with open('application_inventory.json', 'w') as f:
    json.dump(applications, f, indent=2)

print(f"Discovered {len(applications)} applications")

Dependency Mapping

# scripts/analyze_dependencies.py
import networkx as nx
import matplotlib.pyplot as plt

def create_dependency_graph(applications):
    """Create dependency graph from discovered applications"""
    
    G = nx.DiGraph()
    
    # Add nodes (applications)
    for app_name in applications:
        G.add_node(app_name)
    
    # Add edges (dependencies)
    for app_name, app in applications.items():
        for dep in app.get('dependencies', []):
            G.add_edge(app_name, dep)
    
    # Find migration waves (topological sort)
    try:
        migration_waves = list(nx.topological_generations(G))
        
        print("Migration Waves:")
        for i, wave in enumerate(migration_waves, 1):
            print(f"Wave {i}: {', '.join(wave)}")
        
        return migration_waves
    except nx.NetworkXError:
        print("ERROR: Circular dependencies detected!")
        cycles = list(nx.simple_cycles(G))
        print(f"Circular dependencies: {cycles}")
        return None

def visualize_dependencies(applications):
    """Visualize application dependency graph"""
    
    G = nx.DiGraph()
    
    for app_name, app in applications.items():
        G.add_node(app_name)
        for dep in app.get('dependencies', []):
            G.add_edge(app_name, dep)
    
    pos = nx.spring_layout(G)
    nx.draw(G, pos, with_labels=True, node_color='lightblue', 
            node_size=2000, font_size=10, arrows=True)
    
    plt.savefig('dependency_graph.png')
    print("Dependency graph saved to dependency_graph.png")

Migration Patterns

Pattern 1: Lift-and-Shift (Rehost)

Use case: Legacy applications, tight migration timeline

# scripts/lift_and_shift.py
def migrate_ec2_lift_and_shift(source_instance_id, target_account_id, target_vpc_id):
    """
    Lift-and-shift migration using AMI
    """
    
    ec2_source = boto3.client('ec2')
    
    # 1. Create AMI from source instance
    print(f"Creating AMI from instance {source_instance_id}")
    ami = ec2_source.create_image(
        InstanceId=source_instance_id,
        Name=f'migration-{source_instance_id}-{datetime.now().strftime("%Y%m%d%H%M%S")}',
        Description='Migration AMI for lift-and-shift',
        NoReboot=True
    )
    
    ami_id = ami['ImageId']
    print(f"AMI created: {ami_id}")
    
    # 2. Wait for AMI to be available
    waiter = ec2_source.get_waiter('image_available')
    waiter.wait(ImageIds=[ami_id])
    
    # 3. Share AMI with target account
    ec2_source.modify_image_attribute(
        ImageId=ami_id,
        LaunchPermission={
            'Add': [{'UserId': target_account_id}]
        }
    )
    print(f"AMI shared with account {target_account_id}")
    
    # 4. Copy AMI to target account
    # (This would be done by assuming role in target account)
    
    # 5. Launch instance in target account from AMI
    # (This would create EC2 instance in new VPC)
    
    return ami_id

Pattern 2: Refactor (Re-architect)

Use case: Modernize to serverless, containers

# Example: Migrate VM-based app to Lambda + API Gateway
def migrate_to_serverless(source_app):
    """
    Refactor VM-based application to serverless architecture
    """
    
    # 1. Package application code
    package_lambda_code(source_app)
    
    # 2. Create Lambda function
    lambda_client = boto3.client('lambda')
    
    response = lambda_client.create_function(
        FunctionName=f"{source_app['name']}-api",
        Runtime='python3.11',
        Role=f"arn:aws:iam::{target_account_id}:role/LambdaExecutionRole",
        Handler='index.handler',
        Code={
            'S3Bucket': 'migration-artifacts',
            'S3Key': f"{source_app['name']}/lambda.zip"
        },
        Environment={
            'Variables': source_app['environment_variables']
        },
        VpcConfig={
            'SubnetIds': target_subnet_ids,
            'SecurityGroupIds': [target_security_group_id]
        }
    )
    
    # 3. Create API Gateway
    apigw = boto3.client('apigatewayv2')
    
    api = apigw.create_api(
        Name=f"{source_app['name']}-api",
        ProtocolType='HTTP',
        Target=response['FunctionArn']
    )
    
    return {
        'lambda_arn': response['FunctionArn'],
        'api_endpoint': api['ApiEndpoint']
    }

Pattern 3: Replatform

Use case: Move to managed services (RDS, ElastiCache, etc.)

def migrate_database_to_rds(source_db_server):
    """
    Migrate self-managed database to Amazon RDS
    """
    
    rds = boto3.client('rds')
    
    # 1. Create RDS instance with same specs
    db = rds.create_db_instance(
        DBInstanceIdentifier=f"{source_db_server['name']}-rds",
        DBInstanceClass=map_instance_class(source_db_server['specs']),
        Engine='postgres',
        EngineVersion='15.3',
        MasterUsername='admin',
        MasterUserPassword=generate_password(),
        AllocatedStorage=source_db_server['storage_gb'],
        VpcSecurityGroupIds=[target_security_group_id],
        DBSubnetGroupName=target_subnet_group,
        BackupRetentionPeriod=35,
        PreferredBackupWindow='03:00-04:00',
        PreferredMaintenanceWindow='Mon:04:00-Mon:05:00',
        MultiAZ=True,
        StorageEncrypted=True,
        KmsKeyId=kms_key_id,
        EnableCloudwatchLogsExports=['postgresql'],
        DeletionProtection=True
    )
    
    # 2. Wait for DB to be available
    waiter = rds.get_waiter('db_instance_available')
    waiter.wait(DBInstanceIdentifier=db['DBInstance']['DBInstanceIdentifier'])
    
    # 3. Restore data using DMS (Database Migration Service)
    restore_data_via_dms(source_db_server, db['DBInstance'])
    
    return db['DBInstance']

def restore_data_via_dms(source, target):
    """Use AWS DMS for database migration"""
    
    dms = boto3.client('dms')
    
    # Create replication instance
    replication_instance = dms.create_replication_instance(
        ReplicationInstanceIdentifier='migration-replication',
        ReplicationInstanceClass='dms.c5.large',
        AllocatedStorage=100,
        VpcSecurityGroupIds=[security_group_id],
        ReplicationSubnetGroupIdentifier=subnet_group
    )
    
    # Create source endpoint
    source_endpoint = dms.create_endpoint(
        EndpointIdentifier='source-database',
        EndpointType='source',
        EngineName='postgres',
        ServerName=source['hostname'],
        Port=5432,
        DatabaseName=source['database_name'],
        Username=source['username'],
        Password=source['password']
    )
    
    # Create target endpoint
    target_endpoint = dms.create_endpoint(
        EndpointIdentifier='target-rds',
        EndpointType='target',
        EngineName='postgres',
        ServerName=target['Endpoint']['Address'],
        Port=5432,
        DatabaseName=target['DBName'],
        Username='admin',
        Password=target_password
    )
    
    # Create replication task
    task = dms.create_replication_task(
        ReplicationTaskIdentifier='db-migration-task',
        SourceEndpointArn=source_endpoint['Endpoint']['EndpointArn'],
        TargetEndpointArn=target_endpoint['Endpoint']['EndpointArn'],
        ReplicationInstanceArn=replication_instance['ReplicationInstance']['ReplicationInstanceArn'],
        MigrationType='full-load-and-cdc',  # Full load + change data capture
        TableMappings=json.dumps({
            'rules': [{
                'rule-type': 'selection',
                'rule-id': '1',
                'rule-name': '1',
                'object-locator': {
                    'schema-name': '%',
                    'table-name': '%'
                },
                'rule-action': 'include'
            }]
        })
    )
    
    # Start replication
    dms.start_replication_task(
        ReplicationTaskArn=task['ReplicationTask']['ReplicationTaskArn'],
        StartReplicationTaskType='start-replication'
    )

Migration Wave Planning

Wave Strategy

Wave 1 (Week 1): Non-production applications with no dependencies
  - Development environments
  - Test applications
  - Internal tools

Wave 2 (Week 2): Non-critical production with minimal dependencies
  - Monitoring dashboards
  - Logging aggregators
  - Internal APIs

Wave 3 (Week 3): Supporting production services
  - Authentication services
  - Session storage (Redis)
  - Background job processors

Wave 4 (Week 4): Critical production applications
  - Payment processing
  - User-facing APIs
  - Core business logic

Wave 5 (Week 5): Data tier
  - Production databases (with CDC replication)
  - Data warehouses
  - Analytics platforms

Cutover Checklist

# cutover_checklist.yml
pre_cutover:
  - name: "Verify application inventory"
    owner: "Platform Team"
    deadline: "3 days before cutover"
  
  - name: "Backup all source systems"
    owner: "Operations Team"
    deadline: "1 day before cutover"
  
  - name: "Freeze code deployments"
    owner: "Engineering Team"
    deadline: "12 hours before cutover"
  
  - name: "Communicate maintenance window"
    owner: "Product Team"
    deadline: "7 days before cutover"

cutover:
  - name: "Stop source application"
    duration: "5 minutes"
    owner: "Operations Team"
  
  - name: "Final data sync"
    duration: "15-30 minutes"
    owner: "DBA Team"
  
  - name: "Update DNS records"
    duration: "5 minutes"
    owner: "Network Team"
  
  - name: "Start target application"
    duration: "10 minutes"
    owner: "Operations Team"
  
  - name: "Smoke tests"
    duration: "15 minutes"
    owner: "QA Team"
  
  - name: "Monitor for errors"
    duration: "2 hours"
    owner: "SRE Team"

post_cutover:
  - name: "Validate data integrity"
    owner: "DBA Team"
    deadline: "24 hours after cutover"
  
  - name: "Performance testing"
    owner: "SRE Team"
    deadline: "48 hours after cutover"
  
  - name: "Decommission source systems"
    owner: "Operations Team"
    deadline: "30 days after cutover"

Rollback Strategy

Automated Rollback

# scripts/rollback.py
def rollback_migration(migration_id):
    """
    Automated rollback procedure
    """
    
    print(f"ROLLBACK: Starting rollback for migration {migration_id}")
    
    # 1. Retrieve migration metadata
    migration = get_migration_metadata(migration_id)
    
    # 2. Redirect traffic back to source
    print("Step 1: Redirecting traffic to source systems")
    update_dns_records(migration['dns_records'], rollback=True)
    
    # 3. Stop target application
    print("Step 2: Stopping target application")
    stop_application(migration['target_resources'])
    
    # 4. Restart source application
    print("Step 3: Restarting source application")
    start_application(migration['source_resources'])
    
    # 5. Verify source is healthy
    print("Step 4: Verifying source health")
    health_check_passed = verify_health(migration['source_health_check_url'])
    
    if health_check_passed:
        print("✓ Rollback successful - source application healthy")
    else:
        print("✗ Rollback failed - escalating to on-call")
        escalate_to_oncall(migration_id)
    
    # 6. Notify stakeholders
    notify_rollback(migration_id, health_check_passed)
    
    return health_check_passed

What I Learned About Migration

Lesson 1: Discovery Always Takes Longer Than Planned

Budget 2-3x your estimated discovery time. Unknown dependencies appear mid-migration.

Action: Automated discovery tools, network traffic analysis, 4 weeks minimum for discovery.

Lesson 2: Dependencies Are Never Fully Documented

Documentation lies. Network traffic analysis reveals true dependencies.

Action: VPC Flow Log analysis, packet capture, dependency mapping tools.

Lesson 3: Migrate in Small Batches

Migrating 50 applications at once = guaranteed failure. Migrate 5 at a time.

Action: Wave-based migration, 1-2 week waves, validate each wave before next.

Lesson 4: Always Have a Rollback Plan

If you can't rollback in <15 minutes, don't start the migration.

Action: Documented rollback procedures, automated rollback scripts, tested before cutover.

Lesson 5: Communication Prevents Panic

Stakeholders panic when migrations exceed maintenance windows without updates.

Action: Real-time status updates, Slack channel for migration, hourly updates during cutover.

Lesson 6: Data Migration is Always the Bottleneck

Application migration: hours. Data migration: days/weeks.

Action: AWS DMS with CDC, parallel data sync, minimize downtime to final sync only.

Lesson 7: Testing After Migration is Critical

"It works in staging" ≠ "It works in production with real traffic"

Action: Smoke tests, performance tests, gradual traffic shifting (canary deployments).

Next: Real-World Production Example - Complete end-to-end landing zone implementation with full Terraform code, architecture diagrams, and lessons learned.

PreviousSecurity Operations and Threat Protection NextReal-World Production Implementation

Last updated 1 month ago

hashtagIntroduction

hashtagDiscovery and Assessment

hashtagApplication Discovery

hashtagDependency Mapping

hashtagMigration Patterns

hashtagPattern 1: Lift-and-Shift (Rehost)

hashtagPattern 2: Refactor (Re-architect)

hashtagPattern 3: Replatform

hashtagMigration Wave Planning

hashtagWave Strategy

hashtagCutover Checklist

hashtagRollback Strategy

hashtagAutomated Rollback

hashtagWhat I Learned About Migration

hashtagLesson 1: Discovery Always Takes Longer Than Planned

hashtagLesson 2: Dependencies Are Never Fully Documented

hashtagLesson 3: Migrate in Small Batches

hashtagLesson 4: Always Have a Rollback Plan

hashtagLesson 5: Communication Prevents Panic

hashtagLesson 6: Data Migration is Always the Bottleneck

hashtagLesson 7: Testing After Migration is Critical