Zhivko Todorov
ALL CASE STUDIES

CASE 88 · SIENNA · 2024

AZ FAILOVERPLAYBOOKCHAOSMULTI-AZ

A "lose an AZ" playbook the team has actually run.

A B2B SaaS platform claimed multi-AZ on every architecture diagram but had never tested losing one. The first chaos drill — drain a production AZ at 14:00 on a Wednesday — surfaced six different failure modes. We worked through each and turned the drill into a quarterly exercise.

INDUSTRY

B2B SaaS

DOMAIN

LANDING ZONE

DELIVERED

2024

STACK

AWS FAULT INJECTION SIMULATOR·ROUTE 53·CLOUDWATCH ALARMS·EKS (MULTI-AZ)·RDS MULTI-AZ·ELASTICACHE

RESULTS

What changed, by the numbers.

AZ-LOSS RECOVERY TIME

< 90s

CUSTOMER-MEASURED

FAILURE MODES SURFACED

6 → 0

ALL REMEDIATED

QUARTERLY DRILLS

ACTIVE

ROTATING AZ

ARCHITECTURE-DIAGRAM CLAIM

PROVED

MULTI-AZ FOR REAL

HOW IT WENT

The first drill was the awkward one. We drained AZ us-east-1c for forty minutes during business hours, in coordination with leadership and the on-call team. Six failure modes surfaced — an undersized RDS standby that couldn’t absorb load, a Redis cluster pinned to one AZ, two services with non-evenly-distributed pods, an external API that resolved to a single-AZ endpoint, and one config flag nobody knew existed.

Each finding got a remediation ticket with the drill recording attached. The next drill, six weeks later, surfaced two new findings (much smaller). The third drill surfaced zero — the architecture had genuinely caught up with the claim.

AZ-loss recovery time, customer-measured, is now under 90 seconds. The drill rotates AZs quarterly. The architecture diagram says "multi-AZ" and it’s now a verifiable property, not an aspiration.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →