Zhivko Todorov
ALL CASE STUDIES

CASE 144 · RAPID · 2023

ROUTE 53HEALTH CHECKSDNS FAILOVERMULTI-REGION

A DNS failover that the customer never sees.

An e-commerce platform had a hot-standby second region but had never tested a failover under real traffic. Their previous attempt at DNS failover had taken 4 minutes to converge and had pointed half the traffic at a stale endpoint. We rebuilt it around Route 53 health checks and tight TTLs.

INDUSTRY

E-commerce

DOMAIN

RELIABILITY

DELIVERED

2023

STACK

ROUTE 53 HEALTH CHECKS·ROUTE 53 FAILOVER ROUTING·CLOUDWATCH ALARMS·MULTI-REGION CLOUDFRONT·AURORA GLOBAL

RESULTS

What changed, by the numbers.

FAILOVER TIME

< 60s

CUSTOMER-MEASURED

STALE ROUTING WINDOWS

0

HEALTH CHECKS CLEAN

DRILLS / QUARTER

1

ROTATING

TRAFFIC IMPACT

INVISIBLE

BELOW NOTICE THRESHOLD

HOW IT WENT

The previous failover attempt had a TTL of 300 seconds and a health check that polled every 30 seconds, with two-of-three threshold. The math gave you a five-minute floor on convergence — too slow for the kind of regional event the team was trying to insure against.

We dropped the TTL on the failover record to 60 seconds and tightened the health check to 10-second intervals with a 2-of-2 threshold. Aurora Global handled the data side. CloudFront origin failover handled the static assets.

Quarterly drills now: shut down the primary region’s healthcheck endpoint at 14:00. Customers measure under a minute of impact, well below the threshold customer-experience monitoring flags. The team has run four drills and zero unplanned failovers — which is the right ratio.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →