Zhivko Todorov
ALL CASE STUDIES

CASE 04 · STERLING · 2024

DRAURORA GLOBALPROD-READINESSMULTI-REGION

Cross-region DR that survives a production hour, not just a drill.

A publicly-traded SaaS company had a documented DR plan and had never actually tested it under real traffic. Their auditors had stopped accepting "we have a runbook" as evidence. We rebuilt the DR posture around Aurora Global Database and ran four real cutovers with paying customers on the line.

INDUSTRY

Public SaaS

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

AURORA POSTGRES GLOBAL·ROUTE 53 ARC·CLOUDFRONT·TRANSIT GATEWAY·EKS (TWO REGIONS)·BACKUP·CHAOS ENGINEERING

RESULTS

What changed, by the numbers.

RTO

4m

CUSTOMER-MEASURED

RPO

< 1s

AURORA GLOBAL REPLICATION

DR DRILLS

4/yr

REAL TRAFFIC, NOT TABLETOP

COST OVERHEAD

+11%

ON A $180K BASELINE

HOW IT WENT

The old DR plan was a Wiki page and a half-tested Lambda. RPO was claimed at 5 minutes; actual restore-from-backup time tested at 4 hours. Auditors were polite but skeptical.

We rebuilt around Aurora Global Database with a managed failover path, Route 53 Application Recovery Controller for the DNS flip, and a warm-standby EKS cluster in us-west-2 that ran at 20% capacity. Transit Gateway peered the two regions; Backup handled the rest of the state.

Then we ran cutovers. Real ones. During business hours. The first took 17 minutes and surfaced three bugs we fixed in the second. By the fourth drill we were at 4 minutes from "trigger" to "customers transacting in us-west-2." The auditors signed off the same quarter.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →