CASE 04 · STERLING · 2024
Cross-region DR that survives a production hour, not just a drill.
A publicly-traded SaaS company had a documented DR plan and had never actually tested it under real traffic. Their auditors had stopped accepting "we have a runbook" as evidence. We rebuilt the DR posture around Aurora Global Database and ran four real cutovers with paying customers on the line.
Public SaaS
RELIABILITY
2024
RESULTS
What changed, by the numbers.
RTO
4m
RPO
< 1s
DR DRILLS
4/yr
COST OVERHEAD
+11%
HOW IT WENT
The old DR plan was a Wiki page and a half-tested Lambda. RPO was claimed at 5 minutes; actual restore-from-backup time tested at 4 hours. Auditors were polite but skeptical.
We rebuilt around Aurora Global Database with a managed failover path, Route 53 Application Recovery Controller for the DNS flip, and a warm-standby EKS cluster in us-west-2 that ran at 20% capacity. Transit Gateway peered the two regions; Backup handled the rest of the state.
Then we ran cutovers. Real ones. During business hours. The first took 17 minutes and surfaced three bugs we fixed in the second. By the fourth drill we were at 4 minutes from "trigger" to "customers transacting in us-west-2." The auditors signed off the same quarter.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.