Introduction to Event-Driven Ansible

The 3 AM Alert That Changed Everything

At 3:17 AM, the alert fired: "Application server memory usage critical." I was on-call. Half asleep, I SSHed into the server, restarted the application service, confirmed recovery, updated the incident ticket, and went back to bed.

This happened 2-3 times per week. Same alert. Same fix. Same manual intervention at ungodly hours.

Then I discovered Event-Driven Ansible. I built a rulebook that listens for Prometheus alerts, automatically restarts the service when memory is high, verifies recovery, and creates a ServiceNow incident - all without human intervention.

Mean Time to Recovery: 45 minutes → 5 minutes. Pages at 3 AM: 12 per month → 0. Sleep quality: Significantly improved.

This article teaches you how to build self-healing infrastructure with Event-Driven Ansible.

What You'll Learn

Event-Driven Ansible architecture and concepts
Event sources (webhooks, Kafka, Prometheus, etc.)
Rulebook anatomy and syntax
Conditions and actions
Integration with Automation Controller
Self-healing infrastructure patterns
Real-world use cases

What is Event-Driven Ansible?

Traditional Ansible: Pull-based, scheduled automation

# Run every hour via schedule
Schedule: Every 1 hour
Action: Check server health, fix if needed
Problem: Waits up to 1 hour to respond

Event-Driven Ansible: Push-based, reactive automation

# Respond immediately to events
Event: Server health alert received
Action: Immediately fix the issue
Benefit: Sub-minute response time

EDA Architecture

Components

Event-Driven Ansible Controller:
  - Manages rulebooks
  - Receives events from sources
  - Evaluates conditions
  - Triggers actions
  
Event Sources:
  - Webhooks (generic HTTP)
  - Prometheus (metrics/alerts)
  - Kafka (message streams)
  - Azure Event Grid
  - AWS CloudWatch
  - ServiceNow
  - Custom plugins
  
Actions:
  - Run Ansible playbook
  - Launch AAP job template
  - Send webhook/API call
  - Log event
  - Debug print

Event Flow

1. Event Source → Generates event
   Example: Prometheus fires alert

2. EDA Controller → Receives event
   Via webhook, Kafka consumer, etc.

3. Rulebook Engine → Evaluates rules
   Checks if event matches conditions

4. Action Executor → Triggers response
   Launches job template in AAP

5. Result → Feedback/logging
   Updates tracking system

Rulebook Anatomy

Basic Rulebook Structure

---
- name: Self-healing infrastructure
  hosts: all
  
  sources:
    - name: prometheus_alerts
      prometheus.eda.webhook:
        host: 0.0.0.0
        port: 8000
  
  rules:
    - name: Restart service on high memory
      condition: event.alert.labels.severity == "critical"
      action:
        run_playbook:
          name: remediate_memory.yml

Components Explained

Sources: Where events come from

sources:
  - name: webhook_source
    ansible.eda.webhook:
      host: 0.0.0.0
      port: 5000
      
  - name: kafka_source
    ansible.eda.kafka:
      host: kafka.example.com
      port: 9092
      topic: server_events

Rules: Conditions and actions

rules:
  - name: Rule name
    condition: |
      event.type == "alert" and
      event.severity in ["critical", "high"]
    action:
      # What to do when condition matches

Conditions: Boolean expressions

# Simple equality
condition: event.alert_name == "HighMemoryUsage"

# Multiple conditions
condition: >
  event.severity == "critical" and
  event.environment == "production" and
  event.service_name == "web-app"

# Pattern matching
condition: event.message is match("ERROR.*database.*")

# List membership
condition: event.status in ["down", "degraded"]

Actions: What to execute

actions:
  # Run local playbook
  - run_playbook:
      name: restart_service.yml
      
  # Launch AAP job template
  - run_job_template:
      name: "Emergency Restart"
      organization: "Operations"
      
  # Send webhook
  - post_event:
      url: https://hooks.slack.com/XXX
      
  # Log for debugging
  - debug:
      msg: "Event received: {{ event }}"

Real-World Use Cases

Use Case 1: Auto-Remediate High Memory

Problem: Application servers run out of memory, require manual restart

Solution: EDA rulebook with Prometheus integration

Rulebook:

---
- name: Auto-remediate memory issues
  hosts: all
  
  sources:
    - name: prometheus
      prometheus.eda.webhook:
        host: 0.0.0.0
        port: 8000
  
  rules:
    - name: Restart app on high memory
      condition: >
        event.alert.labels.alertname == "HighMemoryUsage" and
        event.alert.labels.severity == "critical" and
        event.alert.status == "firing"
      
      action:
        run_job_template:
          name: "Restart Application Service"
          organization: "Operations"
          inventory: "{{ event.alert.labels.environment }}"
          extra_vars:
            target_host: "{{ event.alert.labels.instance }}"
            service_name: "{{ event.alert.labels.service }}"

Prometheus Alert:

groups:
  - name: application_alerts
    rules:
      - alert: HighMemoryUsage
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) < 10
        for: 5m
        labels:
          severity: critical
          service: web-app
          environment: production
        annotations:
          description: "Memory usage above 90% on {{ $labels.instance }}"

Result: Automatic service restart within 30 seconds of alert

Use Case 2: Security Incident Response

Problem: Failed SSH login attempts indicate potential breach

Solution: Automated blocking and notification

Rulebook:

---
- name: Security incident response
  hosts: all
  
  sources:
    - name: syslog
      ansible.eda.syslog:
        host: 0.0.0.0
        port: 514
  
  rules:
    - name: Block IP after failed logins
      condition: >
        event.message is match(".*Failed password.*") and
        event.failed_attempts > 5
      
      action:
        run_job_template:
          name: "Block Malicious IP"
          extra_vars:
            source_ip: "{{ event.source_ip }}"
            reason: "Failed login attempts"
            
    - name: Notify security team
      condition: event.security_event == true
      action:
        post_event:
          url: https://security-siem.example.com/api/incidents
          headers:
            Authorization: "Bearer {{ token }}"
          body:
            incident_type: "security"
            severity: "high"
            details: "{{ event }}"

Use Case 3: Cloud Cost Optimization

Problem: Dev environments left running overnight waste money

Solution: Auto-shutdown based on time

Rulebook:

---
- name: Cloud cost optimization
  hosts: all
  
  sources:
    - name: schedule
      ansible.eda.tick:
        interval: 3600  # Every hour
  
  rules:
    - name: Shutdown dev instances after hours
      condition: >
        event.hour >= 19 or event.hour <= 7 and
        event.day_of_week in [0,1,2,3,4]  # Weekdays
      
      action:
        run_job_template:
          name: "Stop Development Instances"
          organization: "Engineering"
          extra_vars:
            environment: "development"
            action: "stop"

Use Case 4: Self-Healing Kubernetes

Problem: Pods crash and need restart

Solution: Watch Kubernetes events, auto-recover

Rulebook:

---
- name: Kubernetes self-healing
  hosts: all
  
  sources:
    - name: k8s_events
      kubernetes.eda.events:
        api_server: https://k8s-api.example.com
        namespace: production
        event_types:
          - Pod
  
  rules:
    - name: Restart crashed pods
      condition: >
        event.type == "Warning" and
        event.reason == "BackOff" and
        event.metadata.namespace == "production"
      
      action:
        run_playbook:
          name: kubernetes_restart_pod.yml
          extra_vars:
            pod_name: "{{ event.metadata.name }}"
            namespace: "{{ event.metadata.namespace }}"

Integration with Automation Controller

Launching AAP Job Templates

rules:
  - name: Launch deployment
    condition: event.deployment_requested == true
    
    action:
      run_job_template:
        name: "Deploy Application"
        organization: "Engineering"
        inventory: "Production Servers"
        extra_vars:
          app_version: "{{ event.version }}"
          deploy_region: "{{ event.region }}"
        wait: false  # Don't wait for completion

Passing Event Data

# Event from webhook
{
  "service": "web-app",
  "environment": "production",
  "action": "scale_up",
  "instances": 10
}

# Rulebook passes to AAP
action:
  run_job_template:
    name: "Scale Service"
    extra_vars:
      service_name: "{{ event.service }}"
      environment: "{{ event.environment }}"
      instance_count: "{{ event.instances }}"

Event Sources

Webhook (Generic)

sources:
  - name: generic_webhook
    ansible.eda.webhook:
      host: 0.0.0.0
      port: 5000
      
# Receive events:
curl -X POST http://eda-controller:5000/endpoint \
  -H "Content-Type: application/json" \
  -d '{"event": "deployment", "status": "success"}'

Kafka

sources:
  - name: kafka_stream
    ansible.eda.kafka:
      host: kafka.example.com
      port: 9092
      topic: application_events
      group_id: eda_consumers

Azure Event Grid

sources:
  - name: azure_events
    azure.eda.event_grid:
      subscription_id: "{{ azure_subscription }}"
      resource_group: production-rg
      topic_name: automation-events

Best Practices

1. Event Filtering

# Filter early to reduce processing
rules:
  - name: Only critical production events
    condition: >
      event.severity == "critical" and
      event.environment == "production" and
      not event.test_event

2. Rate Limiting

# Prevent action flooding
rules:
  - name: Restart with cooldown
    condition: event.alert_name == "ServiceDown"
    throttle:
      once_within: 300  # Only once per 5 minutes
    action:
      run_job_template:
        name: "Restart Service"

3. Error Handling

rules:
  - name: Handle failures gracefully
    condition: event.type == "error"
    action:
      - run_playbook:
          name: remediate.yml
      - post_event:  # Always notify
          url: https://slack.webhook
          body:
            message: "Remediation attempted for {{ event }}"

4. Logging and Debugging

rules:
  - name: Log all events
    condition: true  # Catch everything
    action:
      debug:
        msg: |
          Event received:
          Type: {{ event.type }}
          Source: {{ event.source }}
          Data: {{ event | to_json }}

Key Takeaways

✅ Event-Driven Ansible enables reactive automation ✅ Rulebooks define event-condition-action logic ✅ Multiple event sources (Prometheus, Kafka, webhooks) ✅ Integration with AAP for centralized job execution ✅ Self-healing infrastructure patterns reduce MTTR ✅ Best practices include filtering, rate limiting, error handling

What's Next

The next article dives deeper into building advanced event-driven automation with complex rulebooks, multi-source correlation, stateful logic, and production deployment patterns.

Next Article: Building Event-Driven Automation with Rulebooks →

Part of the Ansible Automation Platform 101 Series

PreviousAutomation Mesh and Execution Environments NextBuilding Event-Driven Automation with Rulebooks

Last updated 1 month ago

hashtagThe 3 AM Alert That Changed Everything

hashtagWhat You'll Learn

hashtagWhat is Event-Driven Ansible?

hashtagEDA Architecture

hashtagComponents

hashtagEvent Flow

hashtagRulebook Anatomy

hashtagBasic Rulebook Structure

hashtagComponents Explained

hashtagReal-World Use Cases

hashtagUse Case 1: Auto-Remediate High Memory

hashtagUse Case 2: Security Incident Response

hashtagUse Case 3: Cloud Cost Optimization

hashtagUse Case 4: Self-Healing Kubernetes

hashtagIntegration with Automation Controller

hashtagLaunching AAP Job Templates

hashtagPassing Event Data

hashtagEvent Sources

hashtagWebhook (Generic)

hashtagKafka

hashtagAzure Event Grid

hashtagBest Practices

hashtag1. Event Filtering

hashtag2. Rate Limiting

hashtag3. Error Handling

hashtag4. Logging and Debugging

hashtagKey Takeaways

hashtagWhat's Next

The 3 AM Alert That Changed Everything

What You'll Learn

What is Event-Driven Ansible?

EDA Architecture

Components

Event Flow

Rulebook Anatomy

Basic Rulebook Structure

Components Explained

Real-World Use Cases

Use Case 1: Auto-Remediate High Memory

Use Case 2: Security Incident Response

Use Case 3: Cloud Cost Optimization

Use Case 4: Self-Healing Kubernetes

Integration with Automation Controller

Launching AAP Job Templates

Passing Event Data

Event Sources

Webhook (Generic)

Kafka

Azure Event Grid

Best Practices

1. Event Filtering

2. Rate Limiting

3. Error Handling

4. Logging and Debugging

Key Takeaways

What's Next