Workflows that fail well, not just fail.

A document workflow company had 30 Step Functions workflows where the error-handling pattern was "if anything fails, the workflow fails." Failures landed in CloudWatch and waited for a human. We refactored each workflow with proper retry, catch, and DLQ patterns.

INDUSTRY

Document workflow

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

STEP FUNCTIONS·SQS DLQ·EVENTBRIDGE·CLOUDWATCH ALARMS·LAMBDA (RETRY HELPERS)

RESULTS

What changed, by the numbers.

WORKFLOW FAILURE RATE

−84%

CUSTOMER-VISIBLE

TRANSIENT-ERROR RETRIES

100%

NOW HANDLED

DLQ COVERAGE

100%

ALL WORKFLOWS

ON-CALL PAGES

−72%

WORKFLOW-RELATED

HOW IT WENT

The workflows had been written with optimistic error handling — the happy path was clear, the unhappy path was "let it fail." When a downstream Lambda timed out or an SQS queue throttled, the whole workflow died and a human had to investigate.

We added Retry blocks with exponential backoff for transient errors (timeouts, throttles, 5xx responses), Catch blocks for the genuinely unrecoverable errors that route to DLQs, and proper task-level error classification (`States.TaskFailed` vs application errors). EventBridge triggered alerts on DLQ messages, not on every workflow failure.

Customer-visible workflow failure rate dropped 84%. Transient errors now self-heal; only the genuinely broken cases page anyone. On-call pages related to workflow failures dropped 72%. The team has time to investigate the real problems instead of restarting workflows.

RELATED · SAME DOMAIN

Other engagements in this space.

PILLAR · 2026

An active-active payments API that runs everywhere, all the time.

< 30sRTO (REGION LOSS)

STERLING · 2024

Cross-region DR that survives a production hour, not just a drill.

4mRTO

PREVIOUSWEND — Logs we keep forever, on storage that knows we’re lying.NEXT UNDERTOW — Feature flags that retire themselves when they should.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →