AAP Production Best Practices and Enterprise Deployment

The Production Deployment That Went Wrong (Then Right)

Year 1: Single AAP controller managing 500 servers. Load average: 2.5. Life: Good.

Year 2: Growth to 2,000 servers across 4 regions. Same single controller. Load average: 18. Database locks. Timeout errors. Angry users. On-call pages at 3 AM. Life: Not good.

The re-architecture: 3-node HA cluster, dedicated automation mesh nodes in each region, PostgreSQL clustering, Redis for caching, proper backup/DR strategy, comprehensive monitoring.

Result after re-architecture:

Load average: 18 → 3.5 (even with 4x more servers)
Job execution time: 15 minutes → 4 minutes average
Availability: 97.3% → 99.94%
Database deadlocks: 47 per day → 0 per month
3 AM pages: 28 per month → 1 per year

This article teaches you how to deploy AAP at enterprise scale from day one.

What You'll Learn

High Availability (HA) architecture patterns
Disaster Recovery (DR) strategies
Security hardening and compliance
Performance optimization and scaling
Backup and restore procedures
Monitoring and observability
Capacity planning
Migration strategies

Enterprise Architecture Patterns

Pattern 1: High Availability Cluster (Up to 5,000 Nodes)

Architecture:
  Automation Controller: 3 nodes (active-active-active)
  PostgreSQL: 3 nodes (Primary + 2 replicas)
  Redis: 3 nodes (cluster mode)
  Load Balancer: HAProxy or cloud LB
  
Capacity:
  - 5,000 managed nodes
  - 500 concurrent jobs
  - 50 API requests/second

Architecture Diagram:

High Availability Configuration

PostgreSQL HA with Patroni

# Patroni configuration for PostgreSQL HA
---
scope: aap-postgres-cluster
namespace: /db/
name: postgres-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgres-node-1.example.com:8008

etcd:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 500
        shared_buffers: 4GB
        effective_cache_size: 12GB
        maintenance_work_mem: 1GB
        checkpoint_completion_target: 0.9
        wal_buffers: 16MB
        default_statistics_target: 100
        random_page_cost: 1.1
        effective_io_concurrency: 200
        work_mem: 10MB
        min_wal_size: 1GB
        max_wal_size: 4GB

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgres-node-1.example.com:5432
  data_dir: /var/lib/postgresql/15/main
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: "{{ vault_replication_password }}"
    superuser:
      username: postgres
      password: "{{ vault_postgres_password }}"

Redis Cluster Configuration

# Redis cluster for AAP caching
---
- name: Configure Redis cluster
  hosts: redis_nodes
  become: true
  
  tasks:
    - name: Install Redis
      ansible.builtin.package:
        name: redis
        state: present
    
    - name: Configure Redis cluster mode
      ansible.builtin.template:
        src: redis-cluster.conf.j2
        dest: /etc/redis/redis.conf
      notify: Restart Redis

# redis-cluster.conf.j2
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
bind 0.0.0.0
protected-mode yes
requirepass {{ redis_password }}
maxmemory 4gb
maxmemory-policy allkeys-lru

HAProxy Load Balancer

# HAProxy configuration for AAP
---
global:
  log /dev/log local0
  maxconn 4096
  user haproxy
  group haproxy
  daemon

defaults:
  log global
  mode http
  option httplog
  option dontlognull
  timeout connect 5000ms
  timeout client  50000ms
  timeout server  50000ms

frontend aap_frontend:
  bind *:443 ssl crt /etc/ssl/certs/aap.pem
  default_backend aap_controllers
  
  # Health check endpoint
  acl health_check path /api/v2/ping/
  use_backend aap_controllers if health_check

backend aap_controllers:
  balance roundrobin
  option httpchk GET /api/v2/ping/
  http-check expect status 200
  
  server controller-1 10.0.1.10:443 check ssl verify none
  server controller-2 10.0.1.11:443 check ssl verify none
  server controller-3 10.0.1.12:443 check ssl verify none

# Statistics interface
listen stats:
  bind *:8404
  stats enable
  stats uri /stats
  stats refresh 30s

Disaster Recovery Strategy

Backup Configuration

---
- name: AAP backup automation
  hosts: primary_controller
  become: true
  
  vars:
    backup_dir: /backup/aap
    retention_days: 30
  
  tasks:
    # Backup PostgreSQL database
    - name: Create database backup
      ansible.builtin.command: >
        pg_dump -h {{ database_host }} 
        -U {{ database_user }} 
        -d {{ database_name }} 
        -F c 
        -f {{ backup_dir }}/aap_db_{{ ansible_date_time.iso8601 }}.dump
      environment:
        PGPASSWORD: "{{ database_password }}"
    
    # Backup configuration files
    - name: Backup AAP configuration
      ansible.builtin.archive:
        path:
          - /etc/tower
          - /etc/ansible-automation-platform
        dest: "{{ backup_dir }}/aap_config_{{ ansible_date_time.iso8601 }}.tar.gz"
    
    # Backup projects
    - name: Backup project files
      ansible.builtin.archive:
        path: /var/lib/awx/projects
        dest: "{{ backup_dir }}/aap_projects_{{ ansible_date_time.iso8601 }}.tar.gz"
    
    # Backup custom execution environments
    - name: Export execution environment images
      ansible.builtin.command: >
        podman save 
        -o {{ backup_dir }}/ee_{{ item }}_{{ ansible_date_time.iso8601 }}.tar
        {{ item }}
      loop: "{{ custom_execution_environments }}"
    
    # Upload to S3/Azure Blob
    - name: Upload backup to cloud storage
      amazon.aws.s3_object:
        bucket: aap-backups
        object: "{{ ansible_date_time.date }}/{{ item | basename }}"
        src: "{{ item }}"
        mode: put
        encrypt: true
      loop:
        - "{{ backup_dir }}/aap_db_{{ ansible_date_time.iso8601 }}.dump"
        - "{{ backup_dir }}/aap_config_{{ ansible_date_time.iso8601 }}.tar.gz"
        - "{{ backup_dir }}/aap_projects_{{ ansible_date_time.iso8601 }}.tar.gz"
    
    # Cleanup old backups
    - name: Remove backups older than retention period
      ansible.builtin.find:
        paths: "{{ backup_dir }}"
        age: "{{ retention_days }}d"
        recurse: yes
      register: old_backups
    
    - name: Delete old backup files
      ansible.builtin.file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_backups.files }}"

Disaster Recovery Playbook

---
- name: AAP disaster recovery
  hosts: dr_controller
  become: true
  
  tasks:
    # Restore database
    - name: Download database backup from S3
      amazon.aws.s3_object:
        bucket: aap-backups
        object: "latest/aap_db_backup.dump"
        dest: /tmp/aap_db_restore.dump
        mode: get
    
    - name: Restore PostgreSQL database
      ansible.builtin.command: >
        pg_restore -h {{ dr_database_host }}
        -U {{ database_user }}
        -d {{ database_name }}
        -c /tmp/aap_db_restore.dump
      environment:
        PGPASSWORD: "{{ database_password }}"
    
    # Restore configuration
    - name: Download configuration backup
      amazon.aws.s3_object:
        bucket: aap-backups
        object: "latest/aap_config_backup.tar.gz"
        dest: /tmp/aap_config_restore.tar.gz
        mode: get
    
    - name: Extract configuration files
      ansible.builtin.unarchive:
        src: /tmp/aap_config_restore.tar.gz
        dest: /
        remote_src: true
    
    # Restart services
    - name: Restart AAP services
      ansible.builtin.systemd:
        name: "{{ item }}"
        state: restarted
      loop:
        - automation-controller
        - receptor
    
    # Verify restoration
    - name: Verify AAP health
      ansible.builtin.uri:
        url: https://{{ ansible_host }}/api/v2/ping/
        validate_certs: false
      register: health_check
      retries: 5
      delay: 10
      until: health_check.status == 200

Security Hardening

SSL/TLS Configuration

---
- name: Configure SSL/TLS for AAP
  hosts: aap_controllers
  become: true
  
  tasks:
    - name: Install SSL certificate
      ansible.builtin.copy:
        src: "{{ item.src }}"
        dest: "{{ item.dest }}"
        mode: '0600'
      loop:
        - { src: 'certs/aap.crt', dest: '/etc/tower/tower.cert' }
        - { src: 'certs/aap.key', dest: '/etc/tower/tower.key' }
    
    - name: Configure strong TLS ciphers
      ansible.builtin.lineinfile:
        path: /etc/nginx/nginx.conf
        regexp: '^ssl_ciphers'
        line: 'ssl_ciphers HIGH:!aNULL:!MD5:!RC4;'
    
    - name: Enable HSTS
      ansible.builtin.lineinfile:
        path: /etc/nginx/nginx.conf
        insertafter: 'ssl_ciphers'
        line: 'add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;'

RBAC Security Best Practices

# Least privilege organization design
Organizations:
  Production:
    Teams:
      - Platform Team: Admin rights
      - App Team: Execute only
    
    Credentials:
      - SSH Keys: Encrypted, rotated quarterly
      - Cloud Credentials: Assume role (not static keys)
      - Vault Integration: Dynamic secrets
  
  Development:
    Teams:
      - Developers: Admin rights (dev only)
    
    Isolated from Production:
      - Separate credentials
      - Separate inventories
      - Different execution nodes

Audit Logging

---
- name: Configure comprehensive audit logging
  hosts: aap_controllers
  become: true
  
  tasks:
    - name: Enable AAP activity stream
      ansible.builtin.lineinfile:
        path: /etc/tower/settings.py
        regexp: '^ACTIVITY_STREAM_ENABLED'
        line: 'ACTIVITY_STREAM_ENABLED = True'
    
    - name: Configure log forwarding to SIEM
      ansible.builtin.template:
        src: rsyslog-aap.conf.j2
        dest: /etc/rsyslog.d/aap.conf
      notify: Restart rsyslog

# rsyslog-aap.conf.j2
# Forward AAP logs to Splunk/ELK
*.* @@siem.example.com:514

# Application logs
/var/log/tower/*.log {
  @@siem.example.com:514
}

Performance Optimization

Database Performance Tuning

-- PostgreSQL performance tuning for AAP

-- Index optimization
CREATE INDEX CONCURRENTLY idx_main_job_created 
  ON main_job(created);

CREATE INDEX CONCURRENTLY idx_main_jobevent_job_id 
  ON main_jobevent(job_id);

-- Vacuum configuration
ALTER TABLE main_job SET (autovacuum_vacuum_scale_factor = 0.05);
ALTER TABLE main_jobevent SET (autovacuum_vacuum_scale_factor = 0.02);

-- Connection pooling
ALTER SYSTEM SET max_connections = 500;
ALTER SYSTEM SET shared_buffers = '8GB';
ALTER SYSTEM SET effective_cache_size = '24GB';

-- Query performance
ALTER SYSTEM SET random_page_cost = 1.1;  -- SSD storage
ALTER SYSTEM SET effective_io_concurrency = 200;

SELECT pg_reload_conf();

Controller Performance Settings

# /etc/tower/conf.d/performance.py

# Job settings
AWX_TASK_ENV['JOB_EVENT_BUFFER_SECONDS'] = 1
AWX_TASK_ENV['MAX_CONCURRENT_JOBS'] = 200

# API rate limiting
REST_FRAMEWORK = {
    'DEFAULT_THROTTLE_RATES': {
        'anon': '100/hour',
        'user': '1000/hour'
    }
}

# Websocket settings
CHANNEL_LAYERS = {
    'default': {
        'BACKEND': 'channels_redis.core.RedisChannelLayer',
        'CONFIG': {
            'hosts': [
                ('redis-1.example.com', 6379),
                ('redis-2.example.com', 6379),
                ('redis-3.example.com', 6379),
            ],
            'capacity': 10000,
            'expiry': 10,
        },
    },
}

# Caching
CACHES = {
    'default': {
        'BACKEND': 'django_redis.cache.RedisCache',
        'LOCATION': 'redis://redis-cluster:6379/1',
        'OPTIONS': {
            'CLIENT_CLASS': 'django_redis.client.DefaultClient',
            'CONNECTION_POOL_KWARGS': {'max_connections': 50}
        }
    }
}

Execution Environment Optimization

# execution-environment.yml - Optimized EE
---
version: 3

images:
  base_image:
    name: registry.redhat.io/ansible-automation-platform-24/ee-minimal-rhel9:latest

dependencies:
  galaxy: requirements.yml
  python: requirements.txt
  system: bindep.txt

additional_build_steps:
  prepend_galaxy:
    - RUN ansible-galaxy collection install community.general --force
  
  prepend_final: |
    # Optimize Python packages
    RUN pip install --no-cache-dir --upgrade pip
    
    # Remove unnecessary files
    RUN find /usr/local/lib/python3.9/site-packages -name "*.pyc" -delete
    RUN find /usr/local/lib/python3.9/site-packages -name "__pycache__" -delete
    
    # Set timezone
    ENV TZ=UTC

options:
  package_manager_path: /usr/bin/microdnf
  
build_arg_defaults:
  EE_BASE_IMAGE: 'registry.redhat.io/ansible-automation-platform-24/ee-minimal-rhel9:latest'

Monitoring and Observability

Prometheus Metrics

---
- name: Configure Prometheus monitoring for AAP
  hosts: monitoring_server
  
  tasks:
    - name: Configure AAP scrape targets
      ansible.builtin.blockinfile:
        path: /etc/prometheus/prometheus.yml
        block: |
          - job_name: 'aap-controllers'
            static_configs:
              - targets:
                - 'controller-1.example.com:9090'
                - 'controller-2.example.com:9090'
                - 'controller-3.example.com:9090'
            metrics_path: '/api/v2/metrics/'
            basic_auth:
              username: prometheus
              password: {{ prometheus_password }}

Grafana Dashboard

{
  "dashboard": {
    "title": "AAP Production Metrics",
    "panels": [
      {
        "title": "Job Success Rate",
        "targets": [
          {
            "expr": "rate(awx_job_successful_total[5m]) / rate(awx_job_launched_total[5m])"
          }
        ]
      },
      {
        "title": "Active Jobs",
        "targets": [
          {
            "expr": "awx_running_jobs"
          }
        ]
      },
      {
        "title": "Database Connections",
        "targets": [
          {
            "expr": "pg_stat_activity_count"
          }
        ]
      },
      {
        "title": "API Response Time",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(django_http_requests_latency_seconds_bucket[5m]))"
          }
        ]
      }
    ]
  }
}

Capacity Planning

Sizing Guidelines

Small Deployment (Up to 500 nodes):
  Controllers: 1 node
  CPU: 8 cores
  RAM: 16 GB
  Database: 4 cores, 8 GB RAM
  Disk: 500 GB SSD

Medium Deployment (500-2,000 nodes):
  Controllers: 2 nodes (HA)
  CPU: 16 cores each
  RAM: 32 GB each
  Database: 8 cores, 16 GB RAM
  Disk: 1 TB SSD

Large Deployment (2,000-5,000 nodes):
  Controllers: 3 nodes (HA)
  CPU: 24 cores each
  RAM: 64 GB each
  Database: 16 cores, 32 GB RAM
  Redis: 3 nodes, 8 GB each
  Disk: 2 TB SSD

Enterprise (5,000-10,000+ nodes):
  Controllers: 3+ nodes (HA)
  CPU: 32 cores each
  RAM: 128 GB each
  Database: 32 cores, 64 GB RAM, clustered
  Redis: 3 nodes, 16 GB each
  Automation Mesh: Regional execution nodes
  Disk: 4 TB SSD

Growth Planning

Metrics to Monitor:
  - Job queue depth (should be < 50)
  - Database connection pool usage (< 70%)
  - CPU utilization (< 60% average)
  - Memory usage (< 80%)
  - Disk I/O wait (< 10%)

Scale Up Triggers:
  - Job queue consistently > 100
  - Job execution time increases 50%+
  - Database locks/deadlocks occurring
  - CPU sustained > 80%
  - Memory swapping

Migration Strategies

Migrating from Ansible Tower to AAP

---
- name: Migrate Ansible Tower to AAP
  hosts: tower_server
  
  tasks:
    - name: Backup Tower database
      ansible.builtin.command: >
        tower-cli receive --all > /backup/tower_backup_{{ ansible_date_time.iso8601 }}.json
    
    - name: Install AAP
      ansible.builtin.command:
        cmd: ./setup.sh
        chdir: /opt/ansible-automation-platform-setup-2.4
      environment:
        ANSIBLE_BECOME_PASSWORD: "{{ sudo_password }}"
    
    - name: Migrate data to AAP
      ansible.builtin.command: >
        awx-manage migrate_to_new_schema

Best Practices Checklist

Architecture:
  ✅ High Availability (3+ controller nodes)
  ✅ Database clustering (PostgreSQL + Patroni)
  ✅ Load balancing (HAProxy/ALB)
  ✅ Redis cluster for caching
  ✅ Automation Mesh for multi-region

Security:
  ✅ SSL/TLS with strong ciphers
  ✅ RBAC with least privilege
  ✅ Credential encryption
  ✅ Secrets management (Vault/CyberArk)
  ✅ Audit logging to SIEM
  ✅ Regular security patching

Performance:
  ✅ Database tuning and indexing
  ✅ Connection pooling
  ✅ Optimized execution environments
  ✅ Job concurrency limits
  ✅ API rate limiting

Operations:
  ✅ Automated backups (daily)
  ✅ DR tested quarterly
  ✅ Monitoring with Prometheus/Grafana
  ✅ Capacity planning metrics
  ✅ Upgrade testing in non-prod

Documentation:
  ✅ Architecture diagrams
  ✅ Runbooks for common issues
  ✅ DR procedures documented
  ✅ Escalation paths defined

Key Takeaways

✅ High Availability is mandatory for production (3+ nodes) ✅ PostgreSQL clustering prevents database bottlenecks ✅ Automation Mesh enables multi-region scale ✅ Regular backups with tested DR procedures ✅ Security hardening with RBAC, encryption, audit logging ✅ Performance tuning database, caching, execution environments ✅ Monitoring with Prometheus/Grafana for proactive issues ✅ Capacity planning to scale before hitting limits

Conclusion

You've completed the Ansible Automation Platform 101 series! You now have the knowledge to:

Architect enterprise AAP deployments
Implement automation workflows and RBAC
Build event-driven automation
Integrate with external systems
Optimize for performance and scale
Secure and harden production environments

What's Next?

Implement AAP in your environment
Join the Ansible community
Contribute to Ansible Galaxy
Pursue Red Hat Certified Specialist certification
Build advanced automation content

Series Complete! 🎉 Return to Ansible Automation Platform 101 Home

Part of the Ansible Automation Platform 101 Series

PreviousIntegrating AAP with External Systems NextChef 101

Last updated 2 months ago

hashtagThe Production Deployment That Went Wrong (Then Right)

hashtagWhat You'll Learn

hashtagEnterprise Architecture Patterns

hashtagPattern 1: High Availability Cluster (Up to 5,000 Nodes)

hashtagHigh Availability Configuration

hashtagPostgreSQL HA with Patroni

hashtagRedis Cluster Configuration

hashtagHAProxy Load Balancer

hashtagDisaster Recovery Strategy

hashtagBackup Configuration

hashtagDisaster Recovery Playbook

hashtagSecurity Hardening

hashtagSSL/TLS Configuration

hashtagRBAC Security Best Practices

hashtagAudit Logging

hashtagPerformance Optimization

hashtagDatabase Performance Tuning

hashtagController Performance Settings

hashtagExecution Environment Optimization

hashtagMonitoring and Observability

hashtagPrometheus Metrics

hashtagGrafana Dashboard

hashtagCapacity Planning

hashtagSizing Guidelines

hashtagGrowth Planning

hashtagMigration Strategies

hashtagMigrating from Ansible Tower to AAP

hashtagBest Practices Checklist

hashtagKey Takeaways

hashtagConclusion