CASE 146 · TROVE · 2024
Workflows that fail well, not just fail.
A document workflow company had 30 Step Functions workflows where the error-handling pattern was "if anything fails, the workflow fails." Failures landed in CloudWatch and waited for a human. We refactored each workflow with proper retry, catch, and DLQ patterns.
Document workflow
RELIABILITY
2024
RESULTS
What changed, by the numbers.
WORKFLOW FAILURE RATE
−84%
TRANSIENT-ERROR RETRIES
100%
DLQ COVERAGE
100%
ON-CALL PAGES
−72%
HOW IT WENT
The workflows had been written with optimistic error handling — the happy path was clear, the unhappy path was "let it fail." When a downstream Lambda timed out or an SQS queue throttled, the whole workflow died and a human had to investigate.
We added Retry blocks with exponential backoff for transient errors (timeouts, throttles, 5xx responses), Catch blocks for the genuinely unrecoverable errors that route to DLQs, and proper task-level error classification (`States.TaskFailed` vs application errors). EventBridge triggered alerts on DLQ messages, not on every workflow failure.
Customer-visible workflow failure rate dropped 84%. Transient errors now self-heal; only the genuinely broken cases page anyone. On-call pages related to workflow failures dropped 72%. The team has time to investigate the real problems instead of restarting workflows.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.