CASE 65 · BRIAR · 2026
A bad deploy that only one customer notices.
A SaaS platform with 800 tenants on a shared infrastructure had had two "every customer affected" outages in twelve months. We adopted a cell-based architecture with shuffle-sharded tenant placement; the next bad deploy affected one cell of 32 tenants, not all 800.
SaaS platform
RELIABILITY
2026
RESULTS
What changed, by the numbers.
BLAST RADIUS
32 / 800
CELLS
25
INCIDENT POSTMORTEMS
−71%
COST PER TENANT
+9%
HOW IT WENT
The "everyone affected" outages had a common shape: a bad deploy, a database lock, a third-party dependency. The remediation was always "be more careful." The architecture made being careful the only line of defence.
We split the platform into 25 cells, each a self-contained EKS cluster plus Aurora cluster handling 32 tenants. Shuffle-sharding placed each tenant on a primary cell and two replica cells, so a cell loss affected at most that cell’s 32 tenants and the replicas absorbed the load. Route 53 health checks handled the cutover.
Customer-visible postmortem rate dropped 71%. The next bad deploy affected one cell — its 32 tenants saw a 90-second degradation while the deploy rolled back. The other 768 tenants noticed nothing. The 9% cost overhead is well-spent insurance.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.