Article 5: Playbooks and Human-in-the-Loop Approvals
Introduction
The rule engine decides that something needs to happen. Playbooks define exactly what "something" is β and whether a human needs to sign off before any action is taken.
When I first thought about automated remediation, the obvious approach was: event fires, playbook runs, problem solved. I rejected that design immediately. The failure mode of "playbook runs at the wrong time" in production is much worse than the failure mode of "playbook requires approval and takes 2 extra minutes". This is why simple-ai-agent has a built-in approval gate: every action with non-trivial consequence must go through a human's approve before execution.
This article covers src/aiops/playbooks.py and src/services/approval_manager.py.
Table of Contents
The Problem with Full Automation
I've seen the following happen in production at workplaces where I've worked:
An automation script restarts a pod that's in
CrashLoopBackOffbecause of a broken config. The script restarts it 40 times before anyone notices.A memory-limit increase playbook fires during a traffic spike and raises limits so high that other pods get OOMKilled by the higher-than-available allocations.
A scale-up playbook fires based on a false alert and spins up 10 replicas of a service that was already overloaded at the API gateway layer β doing nothing except wasting money.
I don't want an agent that acts without context. I want an agent that does the analysis, presents its findings and proposed action, and waits for me to say yes.
That's the design: LLM does the thinking, human does the approving, playbook does the executing.
Playbook Structure
Playbooks are defined in config/playbooks.yml. Each playbook is a sequence of steps.
Step Actions
Each step action maps to a function in the MCP tool layer. The pattern is layer:function_name:
k8s:delete_pod
Delete a pod by name (triggers Deployment recreation)
k8s:wait_for_ready
Poll until pod is Running/Ready or timeout
k8s:patch_memory_limit
Patch the Deployment container memory limit
k8s:get_deployment
Fetch current Deployment spec into execution context
k8s:scale_deployment
Set replica count on a Deployment
notify
Send message to configured notification channel
Risk Classification
Every playbook has a risk level. This controls whether the playbook executes immediately or requires approval.
low
Execute immediately, notify after
medium
Post approval request to chat channel, hold execution, 10-minute timeout
high
Warn only β provide runbook steps, do not execute automatically under any circumstances
The high risk level never executes. The playbook executor posts what it would do if executed β the steps, the parameters, the expected outcome β so the operator can do it manually with full context. This is intentional. High-risk operations on production infrastructure should never be delegated to an automated system.
The Approval Manager
The approval manager stores pending approval requests in Redis with a TTL. When approval times out, the request expires automatically β no cleanup needed.
Why 8-Character IDs
When an approval request is posted to Telegram, the user needs to type something like:
Short IDs matter. A full UUID is impractical to type on a phone. Eight characters from uuid4() give enough uniqueness for the small number of concurrent pending approvals a homelab cluster generates.
Chat-Native Approve and Reject
The approval workflow integrates with the chat adapter at the intent detection layer. When the bot receives a message, the intent detector checks for approval commands before any other processing:
This runs before any LLM call. If the message is approve a3f7b2c1, no LLM is invoked β the approval resolves directly. This keeps approval latency low and avoids confusing the LLM with approval syntax.
Playbook Executor Implementation
The executor runs steps sequentially. If any step fails, it runs the rollback steps and stops.
Step Dispatch
Step actions are dispatched by prefix:
Kubernetes actions are dispatched through the MCP client layer rather than calling kubernetes-asyncio directly. This keeps the playbook executor decoupled from Kubernetes API details and reuses the same tool surface exposed to the LLM.
Real Playbooks from the Project
The playbooks I actually run in simple-ai-agent:
I only have two enabled playbooks. This is intentional. More playbooks means more surface area for mistakes. I add a new playbook only when I've seen the manual remediation steps more than three times.
What I Learned
Approval timeout should be long enough to actually respond. I started with 5 minutes. I missed several approvals because I was away from my phone. 10 minutes is better for a personal homelab. In a production team context you might want 15β30 minutes with escalation.
The approval message must include everything needed to make the decision. My earliest approval messages looked like: Approval required: restart-crashed-pod (ID: a3f7). That's useless. The operator needs: which pod, which namespace, what event triggered this, what the proposed action is, and what the rollback is. I reformatted to:
Reject must also be meaningful. When I reject a playbook, the agent logs the rejection with a timestamp and includes it in the next RCA context if the same event fires again. "This event was manually rejected 20 minutes ago" is useful signal.
Don't gate LOW risk actions with approval. Early on I gated everything. This turned the agent into an answering machine β it asked permission to do things that had no meaningful failure mode. notify actions and read-only operations should be LOW risk and execute without approval.
Last updated