On-call automation with runbook bots

Use chat bots, runbook fragments, and guardrails to turn noisy alerts into guided fixes.

Laptop with automation icons

Alert fatigue is usually a process problem, not a people problem. We reduce toil by letting a bot stitch together the right runbook steps as soon as an alert fires.

Keep actions declarative

We store every actionable runbook step in Git with a safety policy and expected output.

id: clear-cache
command: "redis-cli FLUSHDB"
expect:
  - pattern: "OK"
  - pattern: "(empty list or set)", allow_failure: true
safety:
  requires_approval: true
  scopes: ["staging"]
# lint the runbook catalog in CI
runbookctl lint ./runbooks

Build conversation-first flows

When PagerDuty fires, the bot posts context and the next safe action in chat.

const incident = await pagerduty.incident(triggerId);
const steps = await runbook.nextSteps(incident.service, incident.symptom);

await chat.post({
  channel: incident.channel,
  blocks: renderer.summary(incident, steps[0]),
});
# dry-run the flow for a synthetic alert
runbookctl simulate --service payments --symptom "elevated 5xx"

Guard every command

Human-in-the-loop is required unless the action is tagged as safe for automation.

from runbook.guard import require_confirmation, allowed_envs

@require_confirmation(message="Clear cache?", timeout=120)
@allowed_envs(["staging", "dev"])
def clear_cache(cmd):
    return shell(cmd)
# rollout protected commands to prod only after validation
runbookctl promote --action clear-cache --env prod --evidence-url https://reports.internal/cache-validation

Close the loop with telemetry

Every automated step emits metrics so we know when automation is helping or hurting.

metrics.Counter("runbook.steps.executed", map[string]string{
  "service": service,
  "action": action,
  "result": result,
}).Inc()
-- find actions that consistently need human override
select action, count(*) as overrides
from automation_events
where status = 'requires_manual'
group by action
order by overrides desc;

Bots do not replace good operations—they enforce it. Start with low-risk runbooks, keep approvals explicit, and let the chat history become living documentation for the next incident.