CASE 171 · TENDRIL · 2025
Incident playbooks that run themselves while you’re still putting on your laptop.
A fintech’s incident response playbooks were a Notion page the on-call read while diagnosing. Steps like "restart the unhealthy ECS service" or "drain the affected AZ" were manual, with the on-call’s attention split between reading and doing. We automated the high-confidence remediation steps with SSM Automation runbooks.
Fintech
PLATFORM
2025
RESULTS
What changed, by the numbers.
MTTR (AUTOMATED CASES)
< 3m
PLAYBOOKS AUTOMATED
14 / 22
HUMAN-IN-THE-LOOP
PRESERVED
ON-CALL STRESS
LOWER
HOW IT WENT
The playbooks were good — clear steps, tested instructions. The problem was the gap between "reading the playbook" and "executing the steps." On-call would read step three, run the command, wait for it to finish, read step four. Each gap added time when minutes mattered.
SSM Automation runbooks executed the high-confidence steps automatically. PagerDuty incidents triggered specific runbooks via EventBridge based on alert metadata. Human-in-the-loop steps stayed for the judgement calls — "should we fail over the region?" remained a human decision.
MTTR on automated incident classes dropped to under 3 minutes (was 20-45 minutes). 14 of 22 playbooks are now automated end-to-end for the deterministic part. The 8 that aren’t are the ones that genuinely need human judgement. On-call stress, per the quarterly survey, dropped noticeably.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.