CASE 88 · SIENNA · 2024
A "lose an AZ" playbook the team has actually run.
A B2B SaaS platform claimed multi-AZ on every architecture diagram but had never tested losing one. The first chaos drill — drain a production AZ at 14:00 on a Wednesday — surfaced six different failure modes. We worked through each and turned the drill into a quarterly exercise.
B2B SaaS
LANDING ZONE
2024
RESULTS
What changed, by the numbers.
AZ-LOSS RECOVERY TIME
< 90s
FAILURE MODES SURFACED
6 → 0
QUARTERLY DRILLS
ACTIVE
ARCHITECTURE-DIAGRAM CLAIM
PROVED
HOW IT WENT
The first drill was the awkward one. We drained AZ us-east-1c for forty minutes during business hours, in coordination with leadership and the on-call team. Six failure modes surfaced — an undersized RDS standby that couldn’t absorb load, a Redis cluster pinned to one AZ, two services with non-evenly-distributed pods, an external API that resolved to a single-AZ endpoint, and one config flag nobody knew existed.
Each finding got a remediation ticket with the drill recording attached. The next drill, six weeks later, surfaced two new findings (much smaller). The third drill surfaced zero — the architecture had genuinely caught up with the claim.
AZ-loss recovery time, customer-measured, is now under 90 seconds. The drill rotates AZs quarterly. The architecture diagram says "multi-AZ" and it’s now a verifiable property, not an aspiration.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.