CASE 22 · TIMBER · 2024
Chaos engineering that the on-call team actually wanted.
A streaming platform had monthly post-incident reviews that were starting to repeat themselves. The same three failure modes kept resurfacing. We introduced a chaos engineering practice that the on-call team welcomed — because the experiments were aimed at the things they were already worried about, not arbitrary fault injection.
Streaming platform
RELIABILITY
2024
RESULTS
What changed, by the numbers.
RECURRING FAILURE MODES
3 → 0
GAME DAYS
12
AVAILABILITY (CUSTOMER-FACING)
99.985%
ON-CALL PAGES (NIGHT)
−74%
HOW IT WENT
The on-call team had a list of "things that scare me" that they’d quietly maintained for over a year. Most teams have one. We started there — the first chaos experiment was the failure mode at the top of the list (an availability-zone-level network partition during peak traffic).
We used AWS Fault Injection Simulator in a staging environment that mirrored production load. Each Game Day surfaced two or three actionable findings; we shipped fixes in the following two weeks. Experiments moved to production carefully — one experiment per week, in a 30-minute blast-radius-limited window.
After a quarter, the recurring failure modes were gone. The on-call team’s list was shorter and different. The on-call rotation now runs Game Days themselves, monthly, against the new list.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.