Zhivko Todorov
ALL CASE STUDIES

CASE 64 · VISTA · 2023

CIRCUIT BREAKERRESILIENCE4JBULKHEADTIMEOUTS

Cascading failures that stop at the boundary.

A travel booking platform had an architecture where any third-party API slowdown cascaded into a full-platform incident. Hotel-search outages caused car-rental outages caused payment outages. We rolled out circuit breakers, bulkheads, and timeouts across the boundary calls.

INDUSTRY

Travel booking

DOMAIN

RELIABILITY

DELIVERED

2023

STACK

RESILIENCE4J·API GATEWAY·CLOUDWATCH METRICS·X-RAY·EVENTBRIDGE·STEP FUNCTIONS

RESULTS

What changed, by the numbers.

CASCADING INCIDENTS

0

120 DAYS POST-ROLLOUT

THIRD-PARTY OUTAGES SURVIVED

7

NO CUSTOMER IMPACT

MEDIAN BOUNDARY TIMEOUT

3s

WAS UNBOUNDED

P99 LATENCY (HEALTHY)

−14%

NO HEAD-OF-LINE BLOCKING

HOW IT WENT

The pattern showed up in every postmortem: a third-party API got slow, the calling threads accumulated, the application’s connection pool exhausted, unrelated requests started failing. The third-party slowdown was the trigger; the cascade was the architecture.

We added Resilience4j to the service boundaries: circuit breakers with sensible thresholds, bulkheads to isolate connection pools per downstream, timeouts set against measured p99 latencies. Step Functions orchestrated the fallback flows where degraded service was better than no service.

In the 120 days after rollout, the platform survived seven distinct third-party outages without customer-visible impact. Healthy-state p99 latency actually improved 14% — head-of-line blocking from slow downstreams had been silently degrading the median.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →