On-call automation with runbook bots
Use chat bots, runbook fragments, and guardrails to turn noisy alerts into guided fixes.
Alert fatigue is usually a process problem, not a people problem. We reduce toil by letting a bot stitch together the right runbook steps as soon as an alert fires.
Keep actions declarative
We store every actionable runbook step in Git with a safety policy and expected output.
id: clear-cache
command: "redis-cli FLUSHDB"
expect:
- pattern: "OK"
- pattern: "(empty list or set)", allow_failure: true
safety:
requires_approval: true
scopes: ["staging"]
# lint the runbook catalog in CI
runbookctl lint ./runbooks
Build conversation-first flows
When PagerDuty fires, the bot posts context and the next safe action in chat.
const incident = await pagerduty.incident(triggerId);
const steps = await runbook.nextSteps(incident.service, incident.symptom);
await chat.post({
channel: incident.channel,
blocks: renderer.summary(incident, steps[0]),
});
# dry-run the flow for a synthetic alert
runbookctl simulate --service payments --symptom "elevated 5xx"
Guard every command
Human-in-the-loop is required unless the action is tagged as safe for automation.
from runbook.guard import require_confirmation, allowed_envs
@require_confirmation(message="Clear cache?", timeout=120)
@allowed_envs(["staging", "dev"])
def clear_cache(cmd):
return shell(cmd)
# rollout protected commands to prod only after validation
runbookctl promote --action clear-cache --env prod --evidence-url https://reports.internal/cache-validation
Close the loop with telemetry
Every automated step emits metrics so we know when automation is helping or hurting.
metrics.Counter("runbook.steps.executed", map[string]string{
"service": service,
"action": action,
"result": result,
}).Inc()
-- find actions that consistently need human override
select action, count(*) as overrides
from automation_events
where status = 'requires_manual'
group by action
order by overrides desc;
Bots do not replace good operations—they enforce it. Start with low-risk runbooks, keep approvals explicit, and let the chat history become living documentation for the next incident.