Building Event-Driven Automation with Rulebooks

The Multi-System Cascade Failure

It started innocuously: A database backup job ran longer than expected, causing increased I/O wait. This slowed down the API servers. API slowness triggered more retries from the web tier. Web tier overload caused health check failures. Health check failures triggered auto-scaling. Auto-scaling spawned 50 new instances. 50 new instances overwhelmed the database even more.

Total duration: 6 minutes from backup job to complete service outage.

The post-mortem revealed we needed correlation across multiple event sources: monitoring alerts, application logs, cloud events, and database metrics - all evaluated together to prevent cascade failures.

I built an Event-Driven Ansible rulebook that monitors five different event sources simultaneously, correlates patterns, and implements circuit-breaker logic. Now when database I/O spikes, it automatically pauses non-critical background jobs, throttles API request rates, and prevents cascade failures.

This article teaches you advanced Event-Driven Ansible patterns for complex, multi-source automation scenarios.

What You'll Learn

Advanced rulebook patterns (stateful, multi-source)
Event correlation across sources
Stateful logic with conditions
Complex conditional expressions
Error handling and recovery
Testing and debugging rulebooks
Production deployment strategies

Advanced Rulebook Patterns

Pattern 1: Multi-Source Correlation

Monitor multiple event sources, trigger only when patterns align.

---
- name: Cascade failure prevention
  hosts: all
  
  sources:
    # Source 1: Prometheus alerts
    - name: prometheus
      prometheus.eda.webhook:
        host: 0.0.0.0
        port: 8000
    
    # Source 2: Application logs
    - name: app_logs
      ansible.eda.journald:
        match:
          - "application=web-app"
    
    # Source 3: AWS CloudWatch
    - name: aws_events
      amazon.eda.cloudwatch:
        region: us-east-1
        log_group: "/aws/lambda/api-handler"
  
  rules:
    - name: Detect cascade failure pattern
      condition: >
        prometheus.event.alert.labels.alertname == "HighDatabaseIO" and
        app_logs.event.message is match(".*slow query.*") and
        aws_events.event.detail.errorCount > 100
      
      action:
        run_job_template:
          name: "Emergency Circuit Breaker"
          extra_vars:
            action: "enable_circuit_breaker"
            duration: "300"  # 5 minutes

Pattern 2: Stateful Event Tracking

Track state across multiple events before taking action.

---
- name: Stateful deployment monitoring
  hosts: all
  
  sources:
    - name: gitlab_webhook
      ansible.eda.webhook:
        host: 0.0.0.0
        port: 5000
  
  rules:
    - name: Track deployment lifecycle
      condition: event.object_kind == "deployment"
      
      # Store state
      set_fact:
        deployment_{{ event.deployment_id }}_status: "{{ event.status }}"
        deployment_{{ event.deployment_id }}_timestamp: "{{ event.timestamp }}"
    
    - name: Rollback if deployment takes too long
      condition: >
        deployment_{{ event.deployment_id }}_status == "running" and
        (now() - deployment_{{ event.deployment_id }}_timestamp) > 600
      
      action:
        run_job_template:
          name: "Rollback Deployment"
          extra_vars:
            deployment_id: "{{ event.deployment_id }}"

Pattern 3: Event Aggregation and Windowing

Aggregate events over time windows before acting.

---
- name: Alert aggregation
  hosts: all
  
  sources:
    - name: monitoring_alerts
      ansible.eda.webhook:
        host: 0.0.0.0
        port: 8000
  
  rules:
    - name: Aggregate alerts over 5 minutes
      condition: event.type == "alert"
      
      aggregate:
        window: 300  # 5 minutes
        count_field: event.alert_name
        threshold: 10
      
      action:
        run_playbook:
          name: notify_ops_team.yml
          extra_vars:
            alert_summary: "{{ aggregated_alerts }}"
            total_count: "{{ aggregated_count }}"

Pattern 4: Circuit Breaker Logic

Prevent cascading actions when system is unstable.

---
- name: Circuit breaker pattern
  hosts: all
  
  sources:
    - name: app_errors
      ansible.eda.syslog:
        host: 0.0.0.0
        port: 514
  
  rules:
    - name: Track error rate
      condition: event.level == "ERROR"
      
      set_fact:
        error_count: "{{ error_count | default(0) | int + 1 }}"
        circuit_state: >
          {% if error_count > 50 %}
            "open"
          {% else %}
            "closed"
          {% endif %}
    
    - name: Only remediate if circuit closed
      condition: >
        event.remediation_requested == true and
        circuit_state == "closed"
      
      action:
        run_job_template:
          name: "Auto Remediation"
    
    - name: Notify on circuit open
      condition: circuit_state == "open"
      
      action:
        post_event:
          url: https://slack.webhook
          body:
            message: "Circuit breaker OPEN - auto-remediation disabled"

Real-World Advanced Scenarios

Scenario 1: Progressive Remediation

Escalate remediation based on event severity and repetition.

---
- name: Progressive remediation strategy
  hosts: all
  
  sources:
    - name: monitoring
      prometheus.eda.webhook:
        host: 0.0.0.0
        port: 8000
  
  rules:
    # Level 1: Warning - Just log
    - name: Level 1 - Warning
      condition: >
        event.alert.labels.severity == "warning" and
        event.alert.labels.alertname == "HighCPU"
      
      action:
        debug:
          msg: "Warning: High CPU on {{ event.alert.labels.instance }}"
    
    # Level 2: Multiple warnings - Investigate
    - name: Level 2 - Repeated warnings
      condition: >
        event.alert.labels.severity == "warning" and
        event.alert.labels.alertname == "HighCPU"
      
      aggregate:
        window: 600
        threshold: 3
      
      action:
        run_job_template:
          name: "Investigate High CPU"
          extra_vars:
            instance: "{{ event.alert.labels.instance }}"
    
    # Level 3: Critical - Auto-remediate
    - name: Level 3 - Critical
      condition: >
        event.alert.labels.severity == "critical" and
        event.alert.labels.alertname == "HighCPU"
      
      action:
        run_job_template:
          name: "Emergency CPU Remediation"
          extra_vars:
            instance: "{{ event.alert.labels.instance }}"
            action: "restart_service"
    
    # Level 4: Repeated critical - Escalate
    - name: Level 4 - Repeated critical
      condition: >
        event.alert.labels.severity == "critical" and
        event.alert.labels.alertname == "HighCPU"
      
      aggregate:
        window: 1800
        threshold: 2
      
      action:
        - run_job_template:
            name: "Scale Out Infrastructure"
        - post_event:
            url: https://pagerduty.webhook
            body:
              incident_key: "cpu-crisis-{{ event.alert.labels.instance }}"
              description: "Repeated critical CPU alerts"

Scenario 2: Intelligent Auto-Scaling

Scale based on multiple metrics and business hours.

---
- name: Intelligent auto-scaling
  hosts: all
  
  sources:
    - name: cloudwatch
      amazon.eda.cloudwatch:
        region: us-east-1
        namespace: "AWS/ApplicationELB"
    
    - name: business_hours
      ansible.eda.tick:
        interval: 300  # Every 5 minutes
  
  rules:
    - name: Scale up during business hours
      condition: >
        cloudwatch.event.MetricName == "TargetResponseTime" and
        cloudwatch.event.value > 2.0 and
        business_hours.event.hour >= 9 and
        business_hours.event.hour <= 17 and
        business_hours.event.day_of_week in [0,1,2,3,4]
      
      action:
        run_job_template:
          name: "Scale Web Tier"
          extra_vars:
            direction: "up"
            increment: 3
            reason: "High response time during business hours"
    
    - name: Scale down after hours (conservative)
      condition: >
        cloudwatch.event.MetricName == "TargetResponseTime" and
        cloudwatch.event.value < 0.5 and
        (business_hours.event.hour < 9 or business_hours.event.hour > 17)
      
      throttle:
        once_within: 1800  # Only once per 30 minutes
      
      action:
        run_job_template:
          name: "Scale Web Tier"
          extra_vars:
            direction: "down"
            decrement: 1  # More conservative
            reason: "Low traffic after hours"

Scenario 3: Security Incident Correlation

Correlate security events across firewalls, IDS, and application logs.

---
- name: Security incident correlation
  hosts: all
  
  sources:
    - name: firewall_logs
      ansible.eda.syslog:
        host: 0.0.0.0
        port: 514
        tag: "firewall"
    
    - name: ids_alerts
      ansible.eda.webhook:
        host: 0.0.0.0
        port: 5001
    
    - name: app_access_logs
      ansible.eda.journald:
        match:
          - "application=web-app"
          - "event_type=access"
  
  rules:
    - name: Detect coordinated attack
      condition: >
        firewall_logs.event.action == "blocked" and
        ids_alerts.event.signature_id in [1001, 1002, 1003] and
        app_access_logs.event.status_code == 401 and
        firewall_logs.event.source_ip == ids_alerts.event.source_ip and
        ids_alerts.event.source_ip == app_access_logs.event.source_ip
      
      aggregate:
        window: 120  # 2 minutes
        count_field: source_ip
        threshold: 5
      
      action:
        - run_job_template:
            name: "Block Malicious IP"
            extra_vars:
              ip_address: "{{ event.source_ip }}"
              reason: "Coordinated attack detected"
        
        - run_job_template:
            name: "Create Security Incident"
            extra_vars:
              severity: "high"
              attack_vector: "coordinated"
              source_ip: "{{ event.source_ip }}"
              evidence: "{{ aggregated_events | to_json }}"

Scenario 4: Database Failover Automation

Automatic failover with health checks and validation.

---
- name: Database failover automation
  hosts: all
  
  sources:
    - name: db_monitoring
      prometheus.eda.webhook:
        host: 0.0.0.0
        port: 8000
  
  rules:
    - name: Primary database failure detected
      condition: >
        event.alert.labels.alertname == "DatabaseDown" and
        event.alert.labels.role == "primary" and
        event.alert.status == "firing"
      
      action:
        - debug:
            msg: "Primary database failure - initiating failover"
        
        # Step 1: Verify secondary is healthy
        - run_job_template:
            name: "Check Secondary DB Health"
            wait: true
            register: health_check
        
        # Step 2: Promote secondary (conditional)
        - run_job_template:
            name: "Promote Secondary to Primary"
            when: health_check.status == "healthy"
        
        # Step 3: Update application config
        - run_job_template:
            name: "Update App DB Connection"
            extra_vars:
              new_primary: "{{ event.alert.labels.secondary_host }}"
        
        # Step 4: Notify teams
        - post_event:
            url: https://slack.webhook
            body:
              channel: "#database-alerts"
              message: "Database failover completed to {{ event.alert.labels.secondary_host }}"

Complex Conditional Logic

Using Jinja2 Filters

rules:
  - name: Advanced filtering
    condition: >
      event.message | regex_search('ERROR.*database.*timeout') and
      event.timestamp | to_datetime > (now() - timedelta(minutes=5)) and
      event.server_name | regex_replace('^app-', '') in ['web-1', 'web-2']

Boolean Operators

# AND
condition: >
  event.type == "error" and
  event.severity == "high" and
  event.environment == "production"

# OR
condition: >
  event.alert_name in ["HighCPU", "HighMemory", "HighDisk"]

# NOT
condition: >
  event.type == "deployment" and
  not event.environment == "test"

# Complex combinations
condition: >
  (event.severity == "critical" or 
   (event.severity == "high" and event.repeat_count > 3)) and
  event.environment == "production" and
  not event.acknowledged

Pattern Matching

# Regex match
condition: event.message is match("^ERROR.*")

# Regex search
condition: event.log_line is search("database connection failed")

# String contains
condition: "'timeout' in event.error_message"

# List membership
condition: event.status_code in [500, 502, 503, 504]

Error Handling and Recovery

Retry Logic

rules:
  - name: Retry with backoff
    condition: event.type == "failure"
    
    action:
      run_job_template:
        name: "Remediation Playbook"
        retry:
          attempts: 3
          delay: 30  # 30 seconds
          backoff: 2  # Exponential: 30s, 60s, 120s

Fallback Actions

rules:
  - name: Primary action with fallback
    condition: event.alert == "ServiceDown"
    
    action:
      - run_job_template:
          name: "Restart Service"
          register: restart_result
      
      - run_job_template:
          name: "Failover to Standby"
          when: restart_result.status == "failed"
      
      - post_event:
          url: https://pagerduty.webhook
          when: restart_result.status == "failed"

Dead Letter Queue

rules:
  - name: Handle unprocessable events
    condition: event.parse_error == true
    
    action:
      post_event:
        url: https://dead-letter-queue.example.com
        body:
          original_event: "{{ event }}"
          error: "Unable to parse event"
          timestamp: "{{ now() }}"

Testing and Debugging

Local Testing with ansible-rulebook CLI

# Install ansible-rulebook
pip install ansible-rulebook

# Run rulebook locally
ansible-rulebook \
  --rulebook my-rulebook.yml \
  --inventory localhost \
  --verbose

# Send test event
curl -X POST http://localhost:5000/endpoint \
  -H "Content-Type: application/json" \
  -d '{
    "alert": "TestAlert",
    "severity": "critical",
    "instance": "web-1.example.com"
  }'

Debug Mode

rules:
  - name: Debug all events
    condition: true  # Matches everything
    
    action:
      debug:
        msg: |
          Event Type: {{ event.type }}
          Event Source: {{ meta.source.name }}
          Full Event: {{ event | to_json }}
          Matched Condition: {{ ansible_eda.rule_name }}

Dry Run Mode

rules:
  - name: Test without executing
    condition: event.type == "deployment"
    
    action:
      debug:
        msg: "Would execute: Deploy to {{ event.environment }}"
      # Actual action commented out for testing
      # run_job_template:
      #   name: "Deploy Application"

Production Deployment

EDA Controller Configuration

# /etc/ansible-automation-platform/eda-controller.yml
---
activation:
  - name: self-healing-infrastructure
    rulebook: self-healing.yml
    decision_environment: default-decision-environment
    restart_policy: always
    log_level: info
    
  - name: security-incident-response
    rulebook: security-response.yml
    decision_environment: security-de
    restart_policy: on-failure
    log_level: debug

High Availability Setup

# Load balancer configuration for webhook sources
---
haproxy_config:
  frontend eda_webhooks:
    bind: "*:8000"
    mode: http
    default_backend: eda_controllers
  
  backend eda_controllers:
    mode: http
    balance: roundrobin
    servers:
      - eda-controller-1:8000 check
      - eda-controller-2:8000 check
      - eda-controller-3:8000 check

Monitoring EDA Controllers

# Prometheus metrics for EDA
---
- name: Monitor rulebook performance
  hosts: eda_controllers
  
  tasks:
    - name: Collect metrics
      prometheus_query:
        url: http://localhost:9090
        queries:
          - name: events_processed_total
            query: 'eda_events_processed_total{rulebook="self-healing"}'
          
          - name: action_execution_time
            query: 'eda_action_duration_seconds{action_type="run_job_template"}'
          
          - name: rule_match_rate
            query: 'rate(eda_rule_matches_total[5m])'

Best Practices

1. Start Simple, Add Complexity

# Start with basic rule
rules:
  - name: Basic alert response
    condition: event.alert == "HighCPU"
    action:
      debug:
        msg: "Alert received"

# Then add conditions
rules:
  - name: Filtered alert response
    condition: >
      event.alert == "HighCPU" and
      event.severity == "critical"

# Then add actions
rules:
  - name: Full alert response
    condition: >
      event.alert == "HighCPU" and
      event.severity == "critical"
    action:
      run_job_template:
        name: "CPU Remediation"

2. Use Descriptive Rule Names

# Bad
rules:
  - name: rule1
  - name: rule2

# Good
rules:
  - name: Restart application on high memory
  - name: Scale web tier based on response time

3. Version Control Rulebooks

git-book/
└── automation/
    └── event-driven-ansible/
        ├── rulebooks/
        │   ├── production/
        │   │   ├── self-healing.yml
        │   │   └── security-response.yml
        │   └── development/
        │       └── test-rulebook.yml
        └── README.md

4. Test Before Production

# Development environment
ansible-rulebook --rulebook test.yml --verbose

# Staging environment with real events
ansible-rulebook --rulebook staging.yml --inventory staging

# Production with monitoring
ansible-rulebook --rulebook production.yml \
  --inventory production \
  --log-level info \
  --metrics-port 9090

Key Takeaways

✅ Multi-source correlation enables complex automation patterns ✅ Stateful logic tracks events over time ✅ Circuit breakers prevent cascade failures ✅ Progressive remediation escalates based on severity ✅ Error handling with retries and fallbacks ✅ Testing with ansible-rulebook CLI ✅ Production deployment requires HA and monitoring

What's Next

The next article explores Ansible Lightspeed with IBM watsonx Code Assistant - AI-powered automation content generation that writes playbooks, rulebooks, and roles for you.

Next Article: Ansible Lightspeed with IBM watsonx →

Part of the Ansible Automation Platform 101 Series

PreviousIntroduction to Event-Driven Ansible NextAnsible Lightspeed with IBM watsonx

Last updated 1 month ago

hashtagThe Multi-System Cascade Failure

hashtagWhat You'll Learn

hashtagAdvanced Rulebook Patterns

hashtagPattern 1: Multi-Source Correlation

hashtagPattern 2: Stateful Event Tracking

hashtagPattern 3: Event Aggregation and Windowing

hashtagPattern 4: Circuit Breaker Logic

hashtagReal-World Advanced Scenarios

hashtagScenario 1: Progressive Remediation

hashtagScenario 2: Intelligent Auto-Scaling

hashtagScenario 3: Security Incident Correlation

hashtagScenario 4: Database Failover Automation

hashtagComplex Conditional Logic

hashtagUsing Jinja2 Filters

hashtagBoolean Operators

hashtagPattern Matching

hashtagError Handling and Recovery

hashtagRetry Logic

hashtagFallback Actions

hashtagDead Letter Queue

hashtagTesting and Debugging

hashtagLocal Testing with ansible-rulebook CLI

hashtagDebug Mode

hashtagDry Run Mode

hashtagProduction Deployment

hashtagEDA Controller Configuration

hashtagHigh Availability Setup

hashtagMonitoring EDA Controllers

hashtagBest Practices

hashtag1. Start Simple, Add Complexity

hashtag2. Use Descriptive Rule Names

hashtag3. Version Control Rulebooks

hashtag4. Test Before Production

hashtagKey Takeaways

hashtagWhat's Next