Article 5: Playbooks and Human-in-the-Loop Approvals

Introduction

The rule engine decides that something needs to happen. Playbooks define exactly what "something" is — and whether a human needs to sign off before any action is taken.

When I first thought about automated remediation, the obvious approach was: event fires, playbook runs, problem solved. I rejected that design immediately. The failure mode of "playbook runs at the wrong time" in production is much worse than the failure mode of "playbook requires approval and takes 2 extra minutes". This is why simple-ai-agent has a built-in approval gate: every action with non-trivial consequence must go through a human's approve before execution.

This article covers src/aiops/playbooks.py and src/services/approval_manager.py.

The Problem with Full Automation

I've seen the following happen in production at workplaces where I've worked:

An automation script restarts a pod that's in CrashLoopBackOff because of a broken config. The script restarts it 40 times before anyone notices.
A memory-limit increase playbook fires during a traffic spike and raises limits so high that other pods get OOMKilled by the higher-than-available allocations.
A scale-up playbook fires based on a false alert and spins up 10 replicas of a service that was already overloaded at the API gateway layer — doing nothing except wasting money.

I don't want an agent that acts without context. I want an agent that does the analysis, presents its findings and proposed action, and waits for me to say yes.

That's the design: LLM does the thinking, human does the approving, playbook does the executing.

Playbook Structure

Playbooks are defined in config/playbooks.yml. Each playbook is a sequence of steps.

# config/playbooks.yml
playbooks:

  restart-crashed-pod:
    description: "Delete the crashed pod and let the Deployment recreate it"
    risk: medium
    steps:
      - action: "k8s:delete_pod"
        params:
          grace_period_seconds: 0
      - action: "k8s:wait_for_ready"
        params:
          timeout_seconds: 120
    rollback:
      - action: "notify"
        message: "Restart playbook failed. Manual intervention required."

  increase-memory-limit:
    description: "Patch the container memory limit by +512Mi"
    risk: high
    steps:
      - action: "k8s:get_deployment"
      - action: "k8s:patch_memory_limit"
        params:
          delta_mi: 512
          max_limit_gi: 4
    rollback:
      - action: "k8s:revert_patch"
      - action: "notify"
        message: "Memory limit patch failed. Reverted. Manual check required."

  scale-down-zero-replicas-fix:
    description: "Scale deployment to 1 replica if it's currently at 0"
    risk: medium
    steps:
      - action: "k8s:scale_deployment"
        params:
          replicas: 1
    rollback:
      - action: "notify"
        message: "Scale-up failed. Check deployment spec manually."

Step Actions

Each step action maps to a function in the MCP tool layer. The pattern is layer:function_name:

Action

Description

k8s:delete_pod

Delete a pod by name (triggers Deployment recreation)

k8s:wait_for_ready

Poll until pod is Running/Ready or timeout

k8s:patch_memory_limit

Patch the Deployment container memory limit

k8s:get_deployment

Fetch current Deployment spec into execution context

k8s:scale_deployment

Set replica count on a Deployment

notify

Send message to configured notification channel

Risk Classification

Every playbook has a risk level. This controls whether the playbook executes immediately or requires approval.

Risk Level

Execution Behavior

low

Execute immediately, notify after

medium

Post approval request to chat channel, hold execution, 10-minute timeout

high

Warn only — provide runbook steps, do not execute automatically under any circumstances

The high risk level never executes. The playbook executor posts what it would do if executed — the steps, the parameters, the expected outcome — so the operator can do it manually with full context. This is intentional. High-risk operations on production infrastructure should never be delegated to an automated system.

# src/aiops/playbooks.py
from enum import Enum

class RiskLevel(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

async def execute_with_risk_gate(
    playbook: Playbook,
    context: ExecutionContext,
    approval_manager: ApprovalManager,
    notifier: Notifier,
) -> PlaybookResult:
    match playbook.risk:
        case RiskLevel.LOW:
            return await run_steps(playbook, context)
        
        case RiskLevel.MEDIUM:
            approval_id = await approval_manager.request_approval(
                playbook=playbook,
                context=context,
                timeout_seconds=600,   # 10 minutes
            )
            await notifier.send_approval_request(approval_id, playbook, context)
            
            # Block and wait for approval
            approved = await approval_manager.wait_for_decision(approval_id)
            if approved:
                return await run_steps(playbook, context)
            else:
                return PlaybookResult.rejected(approval_id)
        
        case RiskLevel.HIGH:
            await notifier.send_high_risk_warning(playbook, context)
            return PlaybookResult.manual_required(playbook.name)

The Approval Manager

The approval manager stores pending approval requests in Redis with a TTL. When approval times out, the request expires automatically — no cleanup needed.

# src/services/approval_manager.py
import json
import asyncio
import uuid
from dataclasses import dataclass
from datetime import datetime
from redis.asyncio import Redis

@dataclass
class ApprovalRequest:
    id: str
    playbook_name: str
    namespace: str
    resource_name: str
    event_type: str
    created_at: str
    timeout_seconds: int

class ApprovalManager:
    PREFIX = "approval:"
    
    def __init__(self, redis: Redis):
        self.redis = redis
        # Maps approval_id -> asyncio.Event for waiting
        self._pending: dict[str, asyncio.Event] = {}
        self._decisions: dict[str, bool] = {}
    
    async def request_approval(
        self,
        playbook: Playbook,
        context: ExecutionContext,
        timeout_seconds: int,
    ) -> str:
        approval_id = str(uuid.uuid4())[:8]  # Short ID for easy typing in chat
        
        request = ApprovalRequest(
            id=approval_id,
            playbook_name=playbook.name,
            namespace=context.event.namespace,
            resource_name=context.event.resource_name,
            event_type=context.event.event_type,
            created_at=datetime.utcnow().isoformat(),
            timeout_seconds=timeout_seconds,
        )
        
        key = f"{self.PREFIX}{approval_id}"
        await self.redis.setex(
            key,
            timeout_seconds,
            json.dumps(request.__dict__),
        )
        
        self._pending[approval_id] = asyncio.Event()
        return approval_id
    
    async def wait_for_decision(self, approval_id: str) -> bool:
        event = self._pending.get(approval_id)
        if not event:
            return False
        
        try:
            await asyncio.wait_for(event.wait(), timeout=600.0)
            return self._decisions.get(approval_id, False)
        except asyncio.TimeoutError:
            await self._cleanup(approval_id)
            return False  # Timeout = implicit rejection
    
    async def resolve(self, approval_id: str, approved: bool) -> bool:
        """Called when user types 'approve <id>' or 'reject <id>'."""
        key = f"{self.PREFIX}{approval_id}"
        exists = await self.redis.exists(key)
        if not exists:
            return False   # Not found or already expired
        
        event = self._pending.get(approval_id)
        if not event:
            return False
        
        self._decisions[approval_id] = approved
        await self.redis.delete(key)
        event.set()  # Unblocks wait_for_decision
        return True
    
    async def _cleanup(self, approval_id: str) -> None:
        self._pending.pop(approval_id, None)
        self._decisions.pop(approval_id, None)

Why 8-Character IDs

When an approval request is posted to Telegram, the user needs to type something like:

approve a3f7b2c1

Short IDs matter. A full UUID is impractical to type on a phone. Eight characters from uuid4() give enough uniqueness for the small number of concurrent pending approvals a homelab cluster generates.

Chat-Native Approve and Reject

The approval workflow integrates with the chat adapter at the intent detection layer. When the bot receives a message, the intent detector checks for approval commands before any other processing:

# src/conversation/intent.py
import re

APPROVE_PATTERN = re.compile(r'^approve\s+([a-f0-9]{6,8})$', re.IGNORECASE)
REJECT_PATTERN  = re.compile(r'^reject\s+([a-f0-9]{6,8})$',  re.IGNORECASE)

async def handle_approval_commands(
    text: str,
    sender_id: str,
    approval_manager: ApprovalManager,
    notifier: Notifier,
) -> bool:
    """Returns True if the message was an approval command (consumed)."""
    
    if m := APPROVE_PATTERN.match(text.strip()):
        approval_id = m.group(1)
        ok = await approval_manager.resolve(approval_id, approved=True)
        if ok:
            await notifier.send_text(f"Approved {approval_id}. Executing playbook.")
        else:
            await notifier.send_text(f"No pending approval found for ID {approval_id}.")
        return True
    
    if m := REJECT_PATTERN.match(text.strip()):
        approval_id = m.group(1)
        ok = await approval_manager.resolve(approval_id, approved=False)
        if ok:
            await notifier.send_text(f"Rejected {approval_id}. Playbook cancelled.")
        else:
            await notifier.send_text(f"No pending approval found for ID {approval_id}.")
        return True
    
    return False

This runs before any LLM call. If the message is approve a3f7b2c1, no LLM is invoked — the approval resolves directly. This keeps approval latency low and avoids confusing the LLM with approval syntax.

Playbook Executor Implementation

The executor runs steps sequentially. If any step fails, it runs the rollback steps and stops.

# src/aiops/playbooks.py
from dataclasses import dataclass, field
from typing import Any

@dataclass
class ExecutionContext:
    event: ClusterEvent
    state: dict[str, Any] = field(default_factory=dict)  # Shared between steps

@dataclass 
class StepResult:
    action: str
    success: bool
    output: Any = None
    error: str | None = None

async def run_steps(playbook: Playbook, context: ExecutionContext) -> PlaybookResult:
    results: list[StepResult] = []
    
    for step in playbook.steps:
        log.info("playbook.step.start", 
                 playbook=playbook.name, 
                 action=step.action,
                 resource=context.event.resource_name)
        
        try:
            output = await dispatch_step(step, context)
            results.append(StepResult(action=step.action, success=True, output=output))
            
        except Exception as exc:
            log.error("playbook.step.failed",
                      playbook=playbook.name,
                      action=step.action,
                      error=str(exc))
            results.append(StepResult(action=step.action, success=False, error=str(exc)))
            
            # Attempt rollback
            await run_rollback(playbook.rollback, context)
            return PlaybookResult(
                playbook_name=playbook.name,
                success=False,
                step_results=results,
                failure_reason=str(exc),
            )
    
    return PlaybookResult(
        playbook_name=playbook.name,
        success=True,
        step_results=results,
    )

Step Dispatch

Step actions are dispatched by prefix:

async def dispatch_step(step: Step, context: ExecutionContext) -> Any:
    layer, fn_name = step.action.split(":", 1)
    
    match layer:
        case "k8s":
            return await dispatch_kubernetes_action(fn_name, step.params, context)
        case "notify":
            return await dispatch_notify_action(step.message, context)
        case _:
            raise ValueError(f"Unknown action layer: {layer}")

Kubernetes actions are dispatched through the MCP client layer rather than calling kubernetes-asyncio directly. This keeps the playbook executor decoupled from Kubernetes API details and reuses the same tool surface exposed to the LLM.

Real Playbooks from the Project

The playbooks I actually run in simple-ai-agent:

playbooks:

  restart-crashed-pod:
    description: "Delete a pod in CrashLoopBackOff and let the Deployment recreate it"
    risk: medium
    steps:
      - action: "k8s:delete_pod"
        params:
          grace_period_seconds: 30
      - action: "k8s:wait_for_ready"
        params:
          timeout_seconds: 120
          expected_count: 1
    rollback:
      - action: "notify"
        message: "Pod restart failed or did not become Ready within 120s. Check manually."

  scale-to-minimum:
    description: "Scale a zero-replica Deployment back to its minimum (1 replica)"
    risk: medium
    steps:
      - action: "k8s:scale_deployment"
        params:
          replicas: 1
    rollback:
      - action: "notify"
        message: "Failed to scale deployment. Check spec and namespace."

I only have two enabled playbooks. This is intentional. More playbooks means more surface area for mistakes. I add a new playbook only when I've seen the manual remediation steps more than three times.

What I Learned

Approval timeout should be long enough to actually respond. I started with 5 minutes. I missed several approvals because I was away from my phone. 10 minutes is better for a personal homelab. In a production team context you might want 15–30 minutes with escalation.

The approval message must include everything needed to make the decision. My earliest approval messages looked like: Approval required: restart-crashed-pod (ID: a3f7). That's useless. The operator needs: which pod, which namespace, what event triggered this, what the proposed action is, and what the rollback is. I reformatted to:

⚠️ Approval Required [ID: a3f7]
─────────────────────────────
Playbook:   restart-crashed-pod
Namespace:  production
Resource:   api-server-abc123
Trigger:    CrashLoopBackOff (restarts: 7)
Action:     Delete pod (grace: 30s), wait for Ready
Rollback:   Notify on failure
─────────────────────────────
Reply: approve a3f7 or reject a3f7
Expires in: 10 minutes

Reject must also be meaningful. When I reject a playbook, the agent logs the rejection with a timestamp and includes it in the next RCA context if the same event fires again. "This event was manually rejected 20 minutes ago" is useful signal.

Don't gate LOW risk actions with approval. Early on I gated everything. This turned the agent into an answering machine — it asked permission to do things that had no meaningful failure mode. notify actions and read-only operations should be LOW risk and execute without approval.

Next: Article 6 — LLM-Powered Root Cause Analysis

PreviousArticle 4: The Rule Engine — Turning Events into Actionable Alerts NextArticle 6: LLM-Powered Root Cause Analysis

Last updated 28 days ago

hashtagIntroduction

hashtagTable of Contents

hashtagThe Problem with Full Automation

hashtagPlaybook Structure

hashtagStep Actions

hashtagRisk Classification

hashtagThe Approval Manager

hashtagWhy 8-Character IDs

hashtagChat-Native Approve and Reject

hashtagPlaybook Executor Implementation

hashtagStep Dispatch

hashtagReal Playbooks from the Project

hashtagWhat I Learned