Zhivko Todorov
ALL CASE STUDIES

CASE 65 · BRIAR · 2026

CELL-BASEDSHUFFLE-SHARDINGEKSROUTE 53

A bad deploy that only one customer notices.

A SaaS platform with 800 tenants on a shared infrastructure had had two "every customer affected" outages in twelve months. We adopted a cell-based architecture with shuffle-sharded tenant placement; the next bad deploy affected one cell of 32 tenants, not all 800.

INDUSTRY

SaaS platform

DOMAIN

RELIABILITY

DELIVERED

2026

STACK

EKS (PER CELL)·AURORA POSTGRES (PER CELL)·ROUTE 53·SHUFFLE-SHARDING·TERRAFORM·ARGOCD

RESULTS

What changed, by the numbers.

BLAST RADIUS

32 / 800

PER FAILURE EVENT

CELLS

25

CAPACITY-PLANNED PER CELL

INCIDENT POSTMORTEMS

−71%

CUSTOMER-VISIBLE

COST PER TENANT

+9%

WORTHWHILE OVERHEAD

HOW IT WENT

The "everyone affected" outages had a common shape: a bad deploy, a database lock, a third-party dependency. The remediation was always "be more careful." The architecture made being careful the only line of defence.

We split the platform into 25 cells, each a self-contained EKS cluster plus Aurora cluster handling 32 tenants. Shuffle-sharding placed each tenant on a primary cell and two replica cells, so a cell loss affected at most that cell’s 32 tenants and the replicas absorbed the load. Route 53 health checks handled the cutover.

Customer-visible postmortem rate dropped 71%. The next bad deploy affected one cell — its 32 tenants saw a 90-second degradation while the deploy rolled back. The other 768 tenants noticed nothing. The 9% cost overhead is well-spent insurance.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →