CASE 144 · RAPID · 2023
A DNS failover that the customer never sees.
An e-commerce platform had a hot-standby second region but had never tested a failover under real traffic. Their previous attempt at DNS failover had taken 4 minutes to converge and had pointed half the traffic at a stale endpoint. We rebuilt it around Route 53 health checks and tight TTLs.
E-commerce
RELIABILITY
2023
RESULTS
What changed, by the numbers.
FAILOVER TIME
< 60s
STALE ROUTING WINDOWS
0
DRILLS / QUARTER
1
TRAFFIC IMPACT
INVISIBLE
HOW IT WENT
The previous failover attempt had a TTL of 300 seconds and a health check that polled every 30 seconds, with two-of-three threshold. The math gave you a five-minute floor on convergence — too slow for the kind of regional event the team was trying to insure against.
We dropped the TTL on the failover record to 60 seconds and tightened the health check to 10-second intervals with a 2-of-2 threshold. Aurora Global handled the data side. CloudFront origin failover handled the static assets.
Quarterly drills now: shut down the primary region’s healthcheck endpoint at 14:00. Customers measure under a minute of impact, well below the threshold customer-experience monitoring flags. The team has run four drills and zero unplanned failovers — which is the right ratio.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.