Zhivko Todorov
ALL CASE STUDIES

CASE 171 · TENDRIL · 2025

INCIDENT RESPONSEAUTOMATIONSSM RUNBOOKSPAGERDUTY

Incident playbooks that run themselves while you’re still putting on your laptop.

A fintech’s incident response playbooks were a Notion page the on-call read while diagnosing. Steps like "restart the unhealthy ECS service" or "drain the affected AZ" were manual, with the on-call’s attention split between reading and doing. We automated the high-confidence remediation steps with SSM Automation runbooks.

INDUSTRY

Fintech

DOMAIN

PLATFORM

DELIVERED

2025

STACK

SSM AUTOMATION·PAGERDUTY·EVENTBRIDGE·LAMBDA·STEP FUNCTIONS·NOTION (RETIRED RUNBOOKS)

RESULTS

What changed, by the numbers.

MTTR (AUTOMATED CASES)

< 3m

WAS 20–45m

PLAYBOOKS AUTOMATED

14 / 22

HIGH-CONFIDENCE STEPS

HUMAN-IN-THE-LOOP

PRESERVED

FOR JUDGEMENT CALLS

ON-CALL STRESS

LOWER

SURVEY-MEASURED

HOW IT WENT

The playbooks were good — clear steps, tested instructions. The problem was the gap between "reading the playbook" and "executing the steps." On-call would read step three, run the command, wait for it to finish, read step four. Each gap added time when minutes mattered.

SSM Automation runbooks executed the high-confidence steps automatically. PagerDuty incidents triggered specific runbooks via EventBridge based on alert metadata. Human-in-the-loop steps stayed for the judgement calls — "should we fail over the region?" remained a human decision.

MTTR on automated incident classes dropped to under 3 minutes (was 20-45 minutes). 14 of 22 playbooks are now automated end-to-end for the deterministic part. The 8 that aren’t are the ones that genuinely need human judgement. On-call stress, per the quarterly survey, dropped noticeably.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →