Zhivko Todorov
ALL CASE STUDIES

CASE 63 · HEARTH · 2024

SQSLAMBDALOAD LEVELLINGDLQ

Marketing send-storms that don’t blow up the downstream.

A marketing automation product would tip its downstream over once a quarter when a customer triggered a 200,000-recipient campaign. The downstream — third-party email providers and the platform’s own webhook receivers — couldn’t absorb the burst. We introduced SQS-based load levelling with bounded concurrency.

INDUSTRY

Marketing automation

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

SQS·LAMBDA·SQS DLQ·EVENTBRIDGE·CLOUDWATCH ALARMS·TOKEN BUCKETS

RESULTS

What changed, by the numbers.

DOWNSTREAM FAILURES

−100%

BURST-INDUCED

P95 SEND LATENCY

+7s

ACCEPTABLE TRADEOFF

THROUGHPUT

UNCHANGED

OVER 24h WINDOW

DLQ DEPTH (PEAK)

12

WAS 14,000

HOW IT WENT

The fan-out had been instant: campaign starts, every recipient gets queued, every queued message immediately fires a downstream request. The downstream rate-limits, then errors, then circuit-breaks. The campaign that finished in eight minutes also caused 90 minutes of webhook delivery failures.

We placed an SQS queue between the campaign engine and the downstream callers, with a Lambda consumer whose concurrency was bounded to a downstream-friendly rate. Token buckets per third-party provider absorbed their published rate limits. DLQ captured the few permanent failures for human review.

Burst-induced failures dropped to zero. P95 send latency went up about 7 seconds — the campaign now takes 15 minutes instead of 8, which the marketing customers couldn’t tell apart. The 24-hour throughput is the same; we just spread it over the available downstream capacity.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →