AAP Architecture and Components

Designing AAP for 500 Servers Across Three Datacenters

Six months into my AAP journey, we faced a new challenge. Our infrastructure had grown from a single datacenter to three: one on-premises, one in AWS us-east-1, and one in Azure West Europe. We had 500+ servers and network latency between regions was causing automation jobs to time out.

"Just add more Automation Controllers," someone suggested.

That's when I learned that AAP architecture isn't about throwing more servers at problems - it's about understanding the components, their interactions, and designing topology that matches your infrastructure reality.

After studying AAP's architecture and working with Red Hat support, we implemented Automation Mesh with a hub-and-spoke topology. Job execution time dropped by 70%, network bandwidth usage decreased by 60%, and we could scale to 1000+ nodes without breaking a sweat.

This article shares everything I learned about AAP architecture - the knowledge that turned our struggling multi-region deployment into a robust, scalable automation platform.

What You'll Learn

Detailed Automation Controller architecture (control plane vs execution plane)
Automation Hub components and content management
Event-Driven Ansible Controller architecture
Automation Mesh topology patterns and scaling
Database requirements and high availability
Integration architecture with external systems
Capacity planning and performance considerations

Automation Controller Architecture

Automation Controller (formerly Ansible Tower) is the central management component of AAP. Understanding its architecture is crucial for designing scalable deployments.

High-Level Architecture

Core Components Explained

1. Web UI (Frontend)

Technology: React-based single-page application

Responsibilities:

User interface for managing all AAP resources
Dashboard and reporting visualizations
Workflow visual designer
Real-time job output streaming

Real-world insight: The UI is just one interface. Power users often bypass it entirely, using API or CLI (awx command) for automation.

2. REST API (Backend)

Technology: Django REST Framework

Responsibilities:

All AAP operations exposed via RESTful API
Authentication and authorization
Business logic and validation
Integration point for external systems

Example API call:

# Launch a job template via API
curl -X POST https://controller.example.com/api/v2/job_templates/10/launch/ \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"limit": "webservers"}'

Real-world insight: Every action in the UI translates to an API call. I automated our entire AAP configuration using the API, making our infrastructure reproducible via code.

3. Task Engine (Celery)

Technology: Celery distributed task queue

Responsibilities:

Background job processing
Scheduled task execution
Inventory synchronization
Project updates from SCM

Real-world insight: This is the workhorse. During peak hours, we process 50+ concurrent tasks across our controller cluster.

4. Job Dispatcher

Responsibilities:

Determining where to run jobs (which execution node)
Managing execution capacity
Load balancing across execution nodes
Container lifecycle management (Execution Environments)

Real-world insight: The dispatcher is smart about capacity. It won't queue more jobs than your execution nodes can handle.

5. WebSocket Server

Technology: Django Channels with Redis backend

Responsibilities:

Real-time job output streaming to UI
Live status updates
Activity stream events

Real-world insight: This enables the "live tail" feature in the UI - watching playbook execution in real-time.

Control Plane vs Execution Plane

A critical AAP concept is the separation of control and execution:

Control Plane

Manages state and orchestration
Stores configuration in PostgreSQL
Handles API requests
Manages scheduling and dispatch
Does NOT run playbooks

Execution Plane

Runs the actual Ansible playbooks
Can be local or remote (via Automation Mesh)
Scales independently from control plane
Containerized via Execution Environments

Why this matters: You can scale execution capacity (add execution nodes) without touching the control plane. This is how AAP scales to thousands of managed nodes.

Database Architecture (PostgreSQL)

PostgreSQL is AAP's persistence layer and deserves careful attention.

What's Stored in the Database

Inventory: Hosts, groups, variables
Credentials: Encrypted with SECRET_KEY
Job history: All execution logs and results
Templates: Job and workflow templates
Projects: SCM configurations
Users and permissions: RBAC configuration
Activity stream: Audit trail

Database Sizing Considerations

From my experience managing AAP at scale:

Small deployment (< 100 nodes):

Database size: 10-20 GB
Queries/second: < 100
Standard PostgreSQL tuning sufficient

Medium deployment (100-500 nodes):

Database size: 50-100 GB
Queries/second: 200-500
Require PostgreSQL tuning and connection pooling

Large deployment (500+ nodes):

Database size: 200+ GB (grows with job history)
Queries/second: 1000+
Dedicated database server, replicas for read scaling
Aggressive job history cleanup policies

Real-world insight: We had a database grow to 500 GB because we never cleaned up job history. After implementing a 90-day retention policy, it dropped to 80 GB and queries became 3x faster.

Database High Availability

For production deployments:

# Example: PostgreSQL HA with replication
Primary Database (Read/Write):
  - controller-db-primary.example.com
  
Read Replicas (Read-only):
  - controller-db-replica1.example.com
  - controller-db-replica2.example.com
  
Failover:
  - Automatic with Patroni or similar
  - Or manual failover with connection string update

Real-world insight: We use managed PostgreSQL (AWS RDS) in production with automated backups and read replicas. This eliminated database as a single point of failure.

Automation Hub Architecture

Automation Hub is your private content repository for Ansible roles, collections, and Execution Environments.

Components

Content Types

1. Ansible Collections

Bundled roles, modules, and plugins
Versioned and dependency-managed
Source: Red Hat certified, community, or private

Example:

# Collections stored in Automation Hub
ansible.posix (certified)
community.general (validated)
mycompany.infrastructure (private)

2. Execution Environments

Container images with Ansible + dependencies
Includes Python packages, system libraries, collections
Tagged and versioned

Example:

# Execution Environments in registry
hub.example.com/ee-minimal-rhel8:latest
hub.example.com/ee-aws:2.0
hub.example.com/ee-custom-network:1.5

Private vs Public Hub

Private Automation Hub:

Self-hosted in your environment
Full control over content
Curate and approve content
Air-gapped environment support

Public Automation Hub (console.redhat.com):

Red Hat certified content only
Always available
No infrastructure to manage
Requires internet connectivity

Real-world deployment: We run private Automation Hub for our custom content and certified collections, but still sync from public hub for Red Hat certified content.

Event-Driven Ansible Controller Architecture

Event-Driven Ansible (EDA) adds reactive automation capabilities to AAP.

Architecture Components

How EDA Works

Event Sources: External systems send events to EDA Controller
Rulebooks: Define conditions and actions for events
Event Matching: EDA evaluates events against rule conditions
Action Execution: Trigger automation when conditions match

Example Rulebook:

---
- name: Auto-remediate high memory usage
  hosts: all
  sources:
    - name: prometheus
      prometheus.eda.webhook:
        host: 0.0.0.0
        port: 8000
  
  rules:
    - name: Restart service on high memory
      condition: event.alert.labels.severity == "critical" and event.alert.labels.alertname == "HighMemoryUsage"
      action:
        run_job_template:
          name: "Restart Application Services"
          organization: "Operations"

Real-world impact: This architecture enabled us to build self-healing infrastructure that automatically responds to 15 different failure scenarios without human intervention.

Automation Mesh Architecture

Automation Mesh is AAP's solution for scaling across network boundaries, regions, and security zones.

The Problem Mesh Solves

Traditional AAP: All execution nodes must have direct network connectivity to the control plane.

Problem:

Firewall rules become complex
High latency across regions
Can't reach DMZ or isolated networks
Difficult to scale globally

Automation Mesh Solution: Multi-hop execution with intelligent routing.

Mesh Topology Patterns

Pattern 1: Hub and Spoke

Use case: We use this pattern. Controller in our primary datacenter, hop nodes in each region/cloud, execution nodes deployed close to managed infrastructure.

Benefits:

Reduced network latency (execution local to managed nodes)
Simplified firewall rules (only hop nodes need controller access)
Easy to add new regions

Pattern 2: Peered Mesh

Use case: Highly distributed environments where regions need to communicate directly.

Benefits:

Redundant paths
No single point of failure
Lower latency between regions

Mesh Node Types

Control Nodes

Run the Automation Controller application
Handle API, UI, scheduling
Do not run playbook executions
Typically 2-3 nodes for HA

Execution Nodes

Run playbook jobs
Local to managed infrastructure
Scale independently
Can be 100s of nodes

Hop Nodes

Route traffic between controller and execution nodes
Do not run jobs themselves
Bridge network segments
Enable multi-region deployments

Real-world configuration:

# Our production mesh topology
Control Nodes (Primary DC):
  - controller1.example.com
  - controller2.example.com
  
Hop Nodes:
  - hop-aws-us-east-1.example.com
  - hop-azure-westeu.example.com
  - hop-dmz.example.com
  
Execution Nodes:
  - exec-aws-[1-20].example.com (AWS region)
  - exec-azure-[1-20].example.com (Azure region)
  - exec-dmz-[1-5].example.com (DMZ)

Integration Architecture

AAP integrates with numerous external systems. Understanding integration patterns is crucial.

Common Integration Points

Integration Patterns

1. Webhook Integrations

Outbound: AAP sends notifications

# Example: Notify Slack on job failure
Notifications:
  - Type: Slack
    Trigger: Job Failed
    Webhook: https://hooks.slack.com/services/XXX
    Message: "Job {{ job_name }} failed in {{ organization }}"

Inbound: External systems trigger AAP jobs

# Example: GitLab pipeline triggers AAP deployment
curl -X POST https://controller.example.com/api/v2/job_templates/50/launch/ \
  -H "Authorization: Bearer $AAP_TOKEN" \
  -d '{"extra_vars": {"git_commit": "$CI_COMMIT_SHA"}}'

2. ServiceNow Integration

Bi-directional integration:

Create ServiceNow tickets from AAP jobs
Launch AAP jobs from ServiceNow catalog items
Update tickets with job results

Real-world use: Our change management workflow creates ServiceNow change requests before production deployments, waits for approval, then executes automation.

3. Monitoring Integration

Pattern: Monitoring alerts trigger EDA rulebooks trigger AAP jobs

# Prometheus → EDA → AAP flow
Prometheus Alert → EDA Rulebook → AAP Job Template → Remediation

Capacity Planning and Performance

Sizing Guidelines from Real Deployments

Small Deployment (< 100 nodes)

Infrastructure:

1 Automation Controller node
Integrated database on same host
4 vCPU, 16 GB RAM, 40 GB disk

Capacity:

~20 concurrent jobs
~50 jobs/hour sustained
Single point of failure (not HA)

Use case: POC, dev/test environments

Medium Deployment (100-500 nodes)

Infrastructure:

2 Automation Controller nodes (HA)
Separate PostgreSQL server (or managed DB)
Redis for HA
4-8 vCPU, 32 GB RAM per controller
4-6 dedicated execution nodes

Capacity:

~50 concurrent jobs
~200 jobs/hour sustained
High availability

Use case: Production for mid-sized environments

Large Deployment (500+ nodes)

Infrastructure:

3+ Control nodes
PostgreSQL HA cluster or managed DB
Redis HA cluster
Automation Mesh with regional hop nodes
20+ execution nodes distributed across regions
8+ vCPU, 64 GB RAM per control node

Capacity:

100+ concurrent jobs
500+ jobs/hour sustained
Multi-region, fully redundant

Use case: Enterprise production, multiple teams, global infrastructure

Our deployment (managing 1000+ nodes):

3 control nodes (8 vCPU, 64 GB each)
AWS RDS PostgreSQL (db.m5.2xlarge)
ElastiCache Redis (cache.m5.large)
3 hop nodes (one per region)
30 execution nodes (distributed)
Handles 600+ jobs/hour at peak

Performance Tuning Insights

Database tuning made the biggest impact:

-- PostgreSQL tuning for AAP
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 64MB
maintenance_work_mem = 2GB
max_connections = 1000

Job history cleanup is essential:

# Clean up job history > 90 days
awx-manage cleanup_jobs --days=90

Result: Database size dropped 75%, query performance improved 3x.

Key Takeaways

✅ Control plane vs execution plane - understand the separation for proper scaling ✅ Database is critical - size appropriately, tune well, plan for HA ✅ Automation Mesh enables multi-region and network-isolated deployments ✅ Integration architecture connects AAP with enterprise ecosystem ✅ Capacity planning based on managed nodes and job volume ✅ Performance tuning focuses on database, job cleanup, and execution capacity

What's Next

Now that you understand AAP architecture, the next article walks through actually setting up your AAP environment - installation, initial configuration, authentication, and getting ready for production use.

Next Article: Setting Up Your AAP Environment →

Additional Resources

Part of the Ansible Automation Platform 101 Series

PreviousIntroduction to Ansible Automation Platform NextSetting Up AAP Environment

Last updated 1 month ago

hashtagDesigning AAP for 500 Servers Across Three Datacenters

hashtagWhat You'll Learn

hashtagAutomation Controller Architecture

hashtagHigh-Level Architecture

hashtagCore Components Explained

hashtag1. Web UI (Frontend)

hashtag2. REST API (Backend)

hashtag3. Task Engine (Celery)

hashtag4. Job Dispatcher

hashtag5. WebSocket Server

hashtagControl Plane vs Execution Plane

hashtagControl Plane

hashtagExecution Plane

hashtagDatabase Architecture (PostgreSQL)

hashtagWhat's Stored in the Database

hashtagDatabase Sizing Considerations

hashtagDatabase High Availability

hashtagAutomation Hub Architecture

hashtagComponents

hashtagContent Types

hashtag1. Ansible Collections

hashtag2. Execution Environments

hashtagPrivate vs Public Hub

hashtagEvent-Driven Ansible Controller Architecture

hashtagArchitecture Components

hashtagHow EDA Works

hashtagAutomation Mesh Architecture

hashtagThe Problem Mesh Solves

hashtagMesh Topology Patterns

hashtagPattern 1: Hub and Spoke

hashtagPattern 2: Peered Mesh

hashtagMesh Node Types

hashtagControl Nodes

hashtagExecution Nodes

hashtagHop Nodes

hashtagIntegration Architecture

hashtagCommon Integration Points

hashtagIntegration Patterns

hashtag1. Webhook Integrations

hashtag2. ServiceNow Integration

hashtag3. Monitoring Integration

hashtagCapacity Planning and Performance

hashtagSizing Guidelines from Real Deployments

hashtagSmall Deployment (< 100 nodes)

hashtagMedium Deployment (100-500 nodes)

hashtagLarge Deployment (500+ nodes)

hashtagPerformance Tuning Insights

hashtagKey Takeaways

hashtagWhat's Next

hashtagAdditional Resources

Designing AAP for 500 Servers Across Three Datacenters

What You'll Learn

Automation Controller Architecture

High-Level Architecture

Core Components Explained

1. Web UI (Frontend)

2. REST API (Backend)

3. Task Engine (Celery)

4. Job Dispatcher

5. WebSocket Server

Control Plane vs Execution Plane

Control Plane

Execution Plane

Database Architecture (PostgreSQL)

What's Stored in the Database

Database Sizing Considerations

Database High Availability

Automation Hub Architecture

Components

Content Types

1. Ansible Collections

2. Execution Environments

Private vs Public Hub

Event-Driven Ansible Controller Architecture

Architecture Components

How EDA Works

Automation Mesh Architecture

The Problem Mesh Solves

Mesh Topology Patterns

Pattern 1: Hub and Spoke

Pattern 2: Peered Mesh

Mesh Node Types

Control Nodes

Execution Nodes

Hop Nodes

Integration Architecture

Common Integration Points

Integration Patterns

1. Webhook Integrations

2. ServiceNow Integration

3. Monitoring Integration

Capacity Planning and Performance

Sizing Guidelines from Real Deployments

Small Deployment (< 100 nodes)

Medium Deployment (100-500 nodes)

Large Deployment (500+ nodes)

Performance Tuning Insights

Key Takeaways

What's Next

Additional Resources