CASE 64 · VISTA · 2023
Cascading failures that stop at the boundary.
A travel booking platform had an architecture where any third-party API slowdown cascaded into a full-platform incident. Hotel-search outages caused car-rental outages caused payment outages. We rolled out circuit breakers, bulkheads, and timeouts across the boundary calls.
Travel booking
RELIABILITY
2023
RESULTS
What changed, by the numbers.
CASCADING INCIDENTS
0
THIRD-PARTY OUTAGES SURVIVED
7
MEDIAN BOUNDARY TIMEOUT
3s
P99 LATENCY (HEALTHY)
−14%
HOW IT WENT
The pattern showed up in every postmortem: a third-party API got slow, the calling threads accumulated, the application’s connection pool exhausted, unrelated requests started failing. The third-party slowdown was the trigger; the cascade was the architecture.
We added Resilience4j to the service boundaries: circuit breakers with sensible thresholds, bulkheads to isolate connection pools per downstream, timeouts set against measured p99 latencies. Step Functions orchestrated the fallback flows where degraded service was better than no service.
In the 120 days after rollout, the platform survived seven distinct third-party outages without customer-visible impact. Healthy-state p99 latency actually improved 14% — head-of-line blocking from slow downstreams had been silently degrading the median.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.