AAP Architecture and Components

Designing AAP for 500 Servers Across Three Datacenters

Six months into my AAP journey, we faced a new challenge. Our infrastructure had grown from a single datacenter to three: one on-premises, one in AWS us-east-1, and one in Azure West Europe. We had 500+ servers and network latency between regions was causing automation jobs to time out.

"Just add more Automation Controllers," someone suggested.

That's when I learned that AAP architecture isn't about throwing more servers at problems - it's about understanding the components, their interactions, and designing topology that matches your infrastructure reality.

After studying AAP's architecture and working with Red Hat support, we implemented Automation Mesh with a hub-and-spoke topology. Job execution time dropped by 70%, network bandwidth usage decreased by 60%, and we could scale to 1000+ nodes without breaking a sweat.

This article shares everything I learned about AAP architecture - the knowledge that turned our struggling multi-region deployment into a robust, scalable automation platform.

What You'll Learn

  • Detailed Automation Controller architecture (control plane vs execution plane)

  • Automation Hub components and content management

  • Event-Driven Ansible Controller architecture

  • Automation Mesh topology patterns and scaling

  • Database requirements and high availability

  • Integration architecture with external systems

  • Capacity planning and performance considerations

Automation Controller Architecture

Automation Controller (formerly Ansible Tower) is the central management component of AAP. Understanding its architecture is crucial for designing scalable deployments.

High-Level Architecture

spinner

Core Components Explained

1. Web UI (Frontend)

Technology: React-based single-page application

Responsibilities:

  • User interface for managing all AAP resources

  • Dashboard and reporting visualizations

  • Workflow visual designer

  • Real-time job output streaming

Real-world insight: The UI is just one interface. Power users often bypass it entirely, using API or CLI (awx command) for automation.

2. REST API (Backend)

Technology: Django REST Framework

Responsibilities:

  • All AAP operations exposed via RESTful API

  • Authentication and authorization

  • Business logic and validation

  • Integration point for external systems

Example API call:

Real-world insight: Every action in the UI translates to an API call. I automated our entire AAP configuration using the API, making our infrastructure reproducible via code.

3. Task Engine (Celery)

Technology: Celery distributed task queue

Responsibilities:

  • Background job processing

  • Scheduled task execution

  • Inventory synchronization

  • Project updates from SCM

Real-world insight: This is the workhorse. During peak hours, we process 50+ concurrent tasks across our controller cluster.

4. Job Dispatcher

Responsibilities:

  • Determining where to run jobs (which execution node)

  • Managing execution capacity

  • Load balancing across execution nodes

  • Container lifecycle management (Execution Environments)

Real-world insight: The dispatcher is smart about capacity. It won't queue more jobs than your execution nodes can handle.

5. WebSocket Server

Technology: Django Channels with Redis backend

Responsibilities:

  • Real-time job output streaming to UI

  • Live status updates

  • Activity stream events

Real-world insight: This enables the "live tail" feature in the UI - watching playbook execution in real-time.

Control Plane vs Execution Plane

A critical AAP concept is the separation of control and execution:

Control Plane

  • Manages state and orchestration

  • Stores configuration in PostgreSQL

  • Handles API requests

  • Manages scheduling and dispatch

  • Does NOT run playbooks

Execution Plane

  • Runs the actual Ansible playbooks

  • Can be local or remote (via Automation Mesh)

  • Scales independently from control plane

  • Containerized via Execution Environments

spinner

Why this matters: You can scale execution capacity (add execution nodes) without touching the control plane. This is how AAP scales to thousands of managed nodes.

Database Architecture (PostgreSQL)

PostgreSQL is AAP's persistence layer and deserves careful attention.

What's Stored in the Database

  • Inventory: Hosts, groups, variables

  • Credentials: Encrypted with SECRET_KEY

  • Job history: All execution logs and results

  • Templates: Job and workflow templates

  • Projects: SCM configurations

  • Users and permissions: RBAC configuration

  • Activity stream: Audit trail

Database Sizing Considerations

From my experience managing AAP at scale:

Small deployment (< 100 nodes):

  • Database size: 10-20 GB

  • Queries/second: < 100

  • Standard PostgreSQL tuning sufficient

Medium deployment (100-500 nodes):

  • Database size: 50-100 GB

  • Queries/second: 200-500

  • Require PostgreSQL tuning and connection pooling

Large deployment (500+ nodes):

  • Database size: 200+ GB (grows with job history)

  • Queries/second: 1000+

  • Dedicated database server, replicas for read scaling

  • Aggressive job history cleanup policies

Real-world insight: We had a database grow to 500 GB because we never cleaned up job history. After implementing a 90-day retention policy, it dropped to 80 GB and queries became 3x faster.

Database High Availability

For production deployments:

Real-world insight: We use managed PostgreSQL (AWS RDS) in production with automated backups and read replicas. This eliminated database as a single point of failure.

Automation Hub Architecture

Automation Hub is your private content repository for Ansible roles, collections, and Execution Environments.

Components

spinner

Content Types

1. Ansible Collections

  • Bundled roles, modules, and plugins

  • Versioned and dependency-managed

  • Source: Red Hat certified, community, or private

Example:

2. Execution Environments

  • Container images with Ansible + dependencies

  • Includes Python packages, system libraries, collections

  • Tagged and versioned

Example:

Private vs Public Hub

Private Automation Hub:

  • Self-hosted in your environment

  • Full control over content

  • Curate and approve content

  • Air-gapped environment support

Public Automation Hub (console.redhat.com):

  • Red Hat certified content only

  • Always available

  • No infrastructure to manage

  • Requires internet connectivity

Real-world deployment: We run private Automation Hub for our custom content and certified collections, but still sync from public hub for Red Hat certified content.

Event-Driven Ansible Controller Architecture

Event-Driven Ansible (EDA) adds reactive automation capabilities to AAP.

Architecture Components

spinner

How EDA Works

  1. Event Sources: External systems send events to EDA Controller

  2. Rulebooks: Define conditions and actions for events

  3. Event Matching: EDA evaluates events against rule conditions

  4. Action Execution: Trigger automation when conditions match

Example Rulebook:

Real-world impact: This architecture enabled us to build self-healing infrastructure that automatically responds to 15 different failure scenarios without human intervention.

Automation Mesh Architecture

Automation Mesh is AAP's solution for scaling across network boundaries, regions, and security zones.

The Problem Mesh Solves

Traditional AAP: All execution nodes must have direct network connectivity to the control plane.

Problem:

  • Firewall rules become complex

  • High latency across regions

  • Can't reach DMZ or isolated networks

  • Difficult to scale globally

Automation Mesh Solution: Multi-hop execution with intelligent routing.

Mesh Topology Patterns

Pattern 1: Hub and Spoke

spinner

Use case: We use this pattern. Controller in our primary datacenter, hop nodes in each region/cloud, execution nodes deployed close to managed infrastructure.

Benefits:

  • Reduced network latency (execution local to managed nodes)

  • Simplified firewall rules (only hop nodes need controller access)

  • Easy to add new regions

Pattern 2: Peered Mesh

spinner

Use case: Highly distributed environments where regions need to communicate directly.

Benefits:

  • Redundant paths

  • No single point of failure

  • Lower latency between regions

Mesh Node Types

Control Nodes

  • Run the Automation Controller application

  • Handle API, UI, scheduling

  • Do not run playbook executions

  • Typically 2-3 nodes for HA

Execution Nodes

  • Run playbook jobs

  • Local to managed infrastructure

  • Scale independently

  • Can be 100s of nodes

Hop Nodes

  • Route traffic between controller and execution nodes

  • Do not run jobs themselves

  • Bridge network segments

  • Enable multi-region deployments

Real-world configuration:

Integration Architecture

AAP integrates with numerous external systems. Understanding integration patterns is crucial.

Common Integration Points

spinner

Integration Patterns

1. Webhook Integrations

Outbound: AAP sends notifications

Inbound: External systems trigger AAP jobs

2. ServiceNow Integration

Bi-directional integration:

  • Create ServiceNow tickets from AAP jobs

  • Launch AAP jobs from ServiceNow catalog items

  • Update tickets with job results

Real-world use: Our change management workflow creates ServiceNow change requests before production deployments, waits for approval, then executes automation.

3. Monitoring Integration

Pattern: Monitoring alerts trigger EDA rulebooks trigger AAP jobs

Capacity Planning and Performance

Sizing Guidelines from Real Deployments

Small Deployment (< 100 nodes)

Infrastructure:

  • 1 Automation Controller node

  • Integrated database on same host

  • 4 vCPU, 16 GB RAM, 40 GB disk

Capacity:

  • ~20 concurrent jobs

  • ~50 jobs/hour sustained

  • Single point of failure (not HA)

Use case: POC, dev/test environments

Medium Deployment (100-500 nodes)

Infrastructure:

  • 2 Automation Controller nodes (HA)

  • Separate PostgreSQL server (or managed DB)

  • Redis for HA

  • 4-8 vCPU, 32 GB RAM per controller

  • 4-6 dedicated execution nodes

Capacity:

  • ~50 concurrent jobs

  • ~200 jobs/hour sustained

  • High availability

Use case: Production for mid-sized environments

Large Deployment (500+ nodes)

Infrastructure:

  • 3+ Control nodes

  • PostgreSQL HA cluster or managed DB

  • Redis HA cluster

  • Automation Mesh with regional hop nodes

  • 20+ execution nodes distributed across regions

  • 8+ vCPU, 64 GB RAM per control node

Capacity:

  • 100+ concurrent jobs

  • 500+ jobs/hour sustained

  • Multi-region, fully redundant

Use case: Enterprise production, multiple teams, global infrastructure

Our deployment (managing 1000+ nodes):

  • 3 control nodes (8 vCPU, 64 GB each)

  • AWS RDS PostgreSQL (db.m5.2xlarge)

  • ElastiCache Redis (cache.m5.large)

  • 3 hop nodes (one per region)

  • 30 execution nodes (distributed)

  • Handles 600+ jobs/hour at peak

Performance Tuning Insights

Database tuning made the biggest impact:

Job history cleanup is essential:

Result: Database size dropped 75%, query performance improved 3x.

Key Takeaways

βœ… Control plane vs execution plane - understand the separation for proper scaling βœ… Database is critical - size appropriately, tune well, plan for HA βœ… Automation Mesh enables multi-region and network-isolated deployments βœ… Integration architecture connects AAP with enterprise ecosystem βœ… Capacity planning based on managed nodes and job volume βœ… Performance tuning focuses on database, job cleanup, and execution capacity

What's Next

Now that you understand AAP architecture, the next article walks through actually setting up your AAP environment - installation, initial configuration, authentication, and getting ready for production use.


Next Article: Setting Up Your AAP Environment β†’

Additional Resources


Part of the Ansible Automation Platform 101 Series

Last updated