Zhivko Todorov
ALL CASE STUDIES

CASE 151 · YACHT · 2025

ALBTARGET GROUPHEALTH CHECKSTUNING

Targets that fail the right way, not the wrong way.

A streaming media company had an ALB target group where unhealthy instances took 90 seconds to drain and healthy-but-slow instances kept serving traffic. The combination caused a measurable customer-impact event during deploys. We tuned the health-check parameters carefully.

INDUSTRY

Streaming media

DOMAIN

RELIABILITY

DELIVERED

2025

STACK

APPLICATION LOAD BALANCER·TARGET GROUPS·CLOUDWATCH METRICS·ECS FARGATE·DEPLOYMENT CIRCUIT BREAKER

RESULTS

What changed, by the numbers.

DEPLOY-WINDOW INCIDENTS

−92%

YEAR-OVER-YEAR

UNHEALTHY DRAIN TIME

15s

WAS 90s

SLOW-TARGET DETECTION

ENABLED

RESPONSE-TIME ALARM

HEALTHY-THRESHOLD

2

TUNED FROM 5

HOW IT WENT

The defaults that came with the target group had been "what the wizard suggested when we set this up." They had not been revisited as the application matured. Unhealthy-threshold of 5 meant 50 seconds before a misbehaving target got pulled; healthy-threshold of 5 meant 50 more seconds before a recovered target rejoined.

We tuned: unhealthy-threshold to 2 with a 5-second interval (10-second detection), healthy-threshold to 2 (10-second rejoin), connection draining to 15 seconds. Slow-target detection added a CloudWatch alarm on response-time p99 that flagged slow-but-healthy targets for restart via ECS deployment circuit breaker.

Deploy-window incidents dropped 92% year-over-year. The remaining 8% are now caught faster because slow targets get detected before they accumulate enough customer impact to be noticed. Drain time fell to 15 seconds; healthy-rejoin to 10 seconds. The customer experience during deploys is finally what the team always claimed it was.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →