CASE 151 · YACHT · 2025
Targets that fail the right way, not the wrong way.
A streaming media company had an ALB target group where unhealthy instances took 90 seconds to drain and healthy-but-slow instances kept serving traffic. The combination caused a measurable customer-impact event during deploys. We tuned the health-check parameters carefully.
Streaming media
RELIABILITY
2025
RESULTS
What changed, by the numbers.
DEPLOY-WINDOW INCIDENTS
−92%
UNHEALTHY DRAIN TIME
15s
SLOW-TARGET DETECTION
ENABLED
HEALTHY-THRESHOLD
2
HOW IT WENT
The defaults that came with the target group had been "what the wizard suggested when we set this up." They had not been revisited as the application matured. Unhealthy-threshold of 5 meant 50 seconds before a misbehaving target got pulled; healthy-threshold of 5 meant 50 more seconds before a recovered target rejoined.
We tuned: unhealthy-threshold to 2 with a 5-second interval (10-second detection), healthy-threshold to 2 (10-second rejoin), connection draining to 15 seconds. Slow-target detection added a CloudWatch alarm on response-time p99 that flagged slow-but-healthy targets for restart via ECS deployment circuit breaker.
Deploy-window incidents dropped 92% year-over-year. The remaining 8% are now caught faster because slow targets get detected before they accumulate enough customer impact to be noticed. Drain time fell to 15 seconds; healthy-rejoin to 10 seconds. The customer experience during deploys is finally what the team always claimed it was.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.