Zhivko Todorov
ALL CASE STUDIES

CASE 146 · TROVE · 2024

STEP FUNCTIONSERROR HANDLINGRETRYDLQ

Workflows that fail well, not just fail.

A document workflow company had 30 Step Functions workflows where the error-handling pattern was "if anything fails, the workflow fails." Failures landed in CloudWatch and waited for a human. We refactored each workflow with proper retry, catch, and DLQ patterns.

INDUSTRY

Document workflow

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

STEP FUNCTIONS·SQS DLQ·EVENTBRIDGE·CLOUDWATCH ALARMS·LAMBDA (RETRY HELPERS)

RESULTS

What changed, by the numbers.

WORKFLOW FAILURE RATE

−84%

CUSTOMER-VISIBLE

TRANSIENT-ERROR RETRIES

100%

NOW HANDLED

DLQ COVERAGE

100%

ALL WORKFLOWS

ON-CALL PAGES

−72%

WORKFLOW-RELATED

HOW IT WENT

The workflows had been written with optimistic error handling — the happy path was clear, the unhappy path was "let it fail." When a downstream Lambda timed out or an SQS queue throttled, the whole workflow died and a human had to investigate.

We added Retry blocks with exponential backoff for transient errors (timeouts, throttles, 5xx responses), Catch blocks for the genuinely unrecoverable errors that route to DLQs, and proper task-level error classification (`States.TaskFailed` vs application errors). EventBridge triggered alerts on DLQ messages, not on every workflow failure.

Customer-visible workflow failure rate dropped 84%. Transient errors now self-heal; only the genuinely broken cases page anyone. On-call pages related to workflow failures dropped 72%. The team has time to investigate the real problems instead of restarting workflows.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →