AAP Architecture and Components
Designing AAP for 500 Servers Across Three Datacenters
Six months into my AAP journey, we faced a new challenge. Our infrastructure had grown from a single datacenter to three: one on-premises, one in AWS us-east-1, and one in Azure West Europe. We had 500+ servers and network latency between regions was causing automation jobs to time out.
"Just add more Automation Controllers," someone suggested.
That's when I learned that AAP architecture isn't about throwing more servers at problems - it's about understanding the components, their interactions, and designing topology that matches your infrastructure reality.
After studying AAP's architecture and working with Red Hat support, we implemented Automation Mesh with a hub-and-spoke topology. Job execution time dropped by 70%, network bandwidth usage decreased by 60%, and we could scale to 1000+ nodes without breaking a sweat.
This article shares everything I learned about AAP architecture - the knowledge that turned our struggling multi-region deployment into a robust, scalable automation platform.
What You'll Learn
Detailed Automation Controller architecture (control plane vs execution plane)
Automation Hub components and content management
Event-Driven Ansible Controller architecture
Automation Mesh topology patterns and scaling
Database requirements and high availability
Integration architecture with external systems
Capacity planning and performance considerations
Automation Controller Architecture
Automation Controller (formerly Ansible Tower) is the central management component of AAP. Understanding its architecture is crucial for designing scalable deployments.
High-Level Architecture
Core Components Explained
1. Web UI (Frontend)
Technology: React-based single-page application
Responsibilities:
User interface for managing all AAP resources
Dashboard and reporting visualizations
Workflow visual designer
Real-time job output streaming
Real-world insight: The UI is just one interface. Power users often bypass it entirely, using API or CLI (awx command) for automation.
2. REST API (Backend)
Technology: Django REST Framework
Responsibilities:
All AAP operations exposed via RESTful API
Authentication and authorization
Business logic and validation
Integration point for external systems
Example API call:
Real-world insight: Every action in the UI translates to an API call. I automated our entire AAP configuration using the API, making our infrastructure reproducible via code.
3. Task Engine (Celery)
Technology: Celery distributed task queue
Responsibilities:
Background job processing
Scheduled task execution
Inventory synchronization
Project updates from SCM
Real-world insight: This is the workhorse. During peak hours, we process 50+ concurrent tasks across our controller cluster.
4. Job Dispatcher
Responsibilities:
Determining where to run jobs (which execution node)
Managing execution capacity
Load balancing across execution nodes
Container lifecycle management (Execution Environments)
Real-world insight: The dispatcher is smart about capacity. It won't queue more jobs than your execution nodes can handle.
5. WebSocket Server
Technology: Django Channels with Redis backend
Responsibilities:
Real-time job output streaming to UI
Live status updates
Activity stream events
Real-world insight: This enables the "live tail" feature in the UI - watching playbook execution in real-time.
Control Plane vs Execution Plane
A critical AAP concept is the separation of control and execution:
Control Plane
Manages state and orchestration
Stores configuration in PostgreSQL
Handles API requests
Manages scheduling and dispatch
Does NOT run playbooks
Execution Plane
Runs the actual Ansible playbooks
Can be local or remote (via Automation Mesh)
Scales independently from control plane
Containerized via Execution Environments
Why this matters: You can scale execution capacity (add execution nodes) without touching the control plane. This is how AAP scales to thousands of managed nodes.
Database Architecture (PostgreSQL)
PostgreSQL is AAP's persistence layer and deserves careful attention.
What's Stored in the Database
Inventory: Hosts, groups, variables
Credentials: Encrypted with SECRET_KEY
Job history: All execution logs and results
Templates: Job and workflow templates
Projects: SCM configurations
Users and permissions: RBAC configuration
Activity stream: Audit trail
Database Sizing Considerations
From my experience managing AAP at scale:
Small deployment (< 100 nodes):
Database size: 10-20 GB
Queries/second: < 100
Standard PostgreSQL tuning sufficient
Medium deployment (100-500 nodes):
Database size: 50-100 GB
Queries/second: 200-500
Require PostgreSQL tuning and connection pooling
Large deployment (500+ nodes):
Database size: 200+ GB (grows with job history)
Queries/second: 1000+
Dedicated database server, replicas for read scaling
Aggressive job history cleanup policies
Real-world insight: We had a database grow to 500 GB because we never cleaned up job history. After implementing a 90-day retention policy, it dropped to 80 GB and queries became 3x faster.
Database High Availability
For production deployments:
Real-world insight: We use managed PostgreSQL (AWS RDS) in production with automated backups and read replicas. This eliminated database as a single point of failure.
Automation Hub Architecture
Automation Hub is your private content repository for Ansible roles, collections, and Execution Environments.
Components
Content Types
1. Ansible Collections
Bundled roles, modules, and plugins
Versioned and dependency-managed
Source: Red Hat certified, community, or private
Example:
2. Execution Environments
Container images with Ansible + dependencies
Includes Python packages, system libraries, collections
Tagged and versioned
Example:
Private vs Public Hub
Private Automation Hub:
Self-hosted in your environment
Full control over content
Curate and approve content
Air-gapped environment support
Public Automation Hub (console.redhat.com):
Red Hat certified content only
Always available
No infrastructure to manage
Requires internet connectivity
Real-world deployment: We run private Automation Hub for our custom content and certified collections, but still sync from public hub for Red Hat certified content.
Event-Driven Ansible Controller Architecture
Event-Driven Ansible (EDA) adds reactive automation capabilities to AAP.
Architecture Components
How EDA Works
Event Sources: External systems send events to EDA Controller
Rulebooks: Define conditions and actions for events
Event Matching: EDA evaluates events against rule conditions
Action Execution: Trigger automation when conditions match
Example Rulebook:
Real-world impact: This architecture enabled us to build self-healing infrastructure that automatically responds to 15 different failure scenarios without human intervention.
Automation Mesh Architecture
Automation Mesh is AAP's solution for scaling across network boundaries, regions, and security zones.
The Problem Mesh Solves
Traditional AAP: All execution nodes must have direct network connectivity to the control plane.
Problem:
Firewall rules become complex
High latency across regions
Can't reach DMZ or isolated networks
Difficult to scale globally
Automation Mesh Solution: Multi-hop execution with intelligent routing.
Mesh Topology Patterns
Pattern 1: Hub and Spoke
Use case: We use this pattern. Controller in our primary datacenter, hop nodes in each region/cloud, execution nodes deployed close to managed infrastructure.
Benefits:
Reduced network latency (execution local to managed nodes)
Simplified firewall rules (only hop nodes need controller access)
Easy to add new regions
Pattern 2: Peered Mesh
Use case: Highly distributed environments where regions need to communicate directly.
Benefits:
Redundant paths
No single point of failure
Lower latency between regions
Mesh Node Types
Control Nodes
Run the Automation Controller application
Handle API, UI, scheduling
Do not run playbook executions
Typically 2-3 nodes for HA
Execution Nodes
Run playbook jobs
Local to managed infrastructure
Scale independently
Can be 100s of nodes
Hop Nodes
Route traffic between controller and execution nodes
Do not run jobs themselves
Bridge network segments
Enable multi-region deployments
Real-world configuration:
Integration Architecture
AAP integrates with numerous external systems. Understanding integration patterns is crucial.
Common Integration Points
Integration Patterns
1. Webhook Integrations
Outbound: AAP sends notifications
Inbound: External systems trigger AAP jobs
2. ServiceNow Integration
Bi-directional integration:
Create ServiceNow tickets from AAP jobs
Launch AAP jobs from ServiceNow catalog items
Update tickets with job results
Real-world use: Our change management workflow creates ServiceNow change requests before production deployments, waits for approval, then executes automation.
3. Monitoring Integration
Pattern: Monitoring alerts trigger EDA rulebooks trigger AAP jobs
Capacity Planning and Performance
Sizing Guidelines from Real Deployments
Small Deployment (< 100 nodes)
Infrastructure:
1 Automation Controller node
Integrated database on same host
4 vCPU, 16 GB RAM, 40 GB disk
Capacity:
~20 concurrent jobs
~50 jobs/hour sustained
Single point of failure (not HA)
Use case: POC, dev/test environments
Medium Deployment (100-500 nodes)
Infrastructure:
2 Automation Controller nodes (HA)
Separate PostgreSQL server (or managed DB)
Redis for HA
4-8 vCPU, 32 GB RAM per controller
4-6 dedicated execution nodes
Capacity:
~50 concurrent jobs
~200 jobs/hour sustained
High availability
Use case: Production for mid-sized environments
Large Deployment (500+ nodes)
Infrastructure:
3+ Control nodes
PostgreSQL HA cluster or managed DB
Redis HA cluster
Automation Mesh with regional hop nodes
20+ execution nodes distributed across regions
8+ vCPU, 64 GB RAM per control node
Capacity:
100+ concurrent jobs
500+ jobs/hour sustained
Multi-region, fully redundant
Use case: Enterprise production, multiple teams, global infrastructure
Our deployment (managing 1000+ nodes):
3 control nodes (8 vCPU, 64 GB each)
AWS RDS PostgreSQL (db.m5.2xlarge)
ElastiCache Redis (cache.m5.large)
3 hop nodes (one per region)
30 execution nodes (distributed)
Handles 600+ jobs/hour at peak
Performance Tuning Insights
Database tuning made the biggest impact:
Job history cleanup is essential:
Result: Database size dropped 75%, query performance improved 3x.
Key Takeaways
β Control plane vs execution plane - understand the separation for proper scaling β Database is critical - size appropriately, tune well, plan for HA β Automation Mesh enables multi-region and network-isolated deployments β Integration architecture connects AAP with enterprise ecosystem β Capacity planning based on managed nodes and job volume β Performance tuning focuses on database, job cleanup, and execution capacity
What's Next
Now that you understand AAP architecture, the next article walks through actually setting up your AAP environment - installation, initial configuration, authentication, and getting ready for production use.
Next Article: Setting Up Your AAP Environment β
Additional Resources
Part of the Ansible Automation Platform 101 Series
Last updated