Zhivko Todorov
ALL CASE STUDIES

CASE 22 · TIMBER · 2024

CHAOS ENGINEERINGFISGAME DAYSSLO

Chaos engineering that the on-call team actually wanted.

A streaming platform had monthly post-incident reviews that were starting to repeat themselves. The same three failure modes kept resurfacing. We introduced a chaos engineering practice that the on-call team welcomed — because the experiments were aimed at the things they were already worried about, not arbitrary fault injection.

INDUSTRY

Streaming platform

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

AWS FAULT INJECTION SIMULATOR·EKS·CLOUDWATCH SYNTHETICS·X-RAY·PROMETHEUS·GAME DAY RUNBOOKS

RESULTS

What changed, by the numbers.

RECURRING FAILURE MODES

3 → 0

IN 90 DAYS

GAME DAYS

12

OVER A QUARTER

AVAILABILITY (CUSTOMER-FACING)

99.985%

WAS 99.91%

ON-CALL PAGES (NIGHT)

−74%

AFTER REMEDIATION CYCLE

HOW IT WENT

The on-call team had a list of "things that scare me" that they’d quietly maintained for over a year. Most teams have one. We started there — the first chaos experiment was the failure mode at the top of the list (an availability-zone-level network partition during peak traffic).

We used AWS Fault Injection Simulator in a staging environment that mirrored production load. Each Game Day surfaced two or three actionable findings; we shipped fixes in the following two weeks. Experiments moved to production carefully — one experiment per week, in a 30-minute blast-radius-limited window.

After a quarter, the recurring failure modes were gone. The on-call team’s list was shorter and different. The on-call rotation now runs Game Days themselves, monthly, against the new list.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →