CASE 63 · HEARTH · 2024
Marketing send-storms that don’t blow up the downstream.
A marketing automation product would tip its downstream over once a quarter when a customer triggered a 200,000-recipient campaign. The downstream — third-party email providers and the platform’s own webhook receivers — couldn’t absorb the burst. We introduced SQS-based load levelling with bounded concurrency.
Marketing automation
RELIABILITY
2024
RESULTS
What changed, by the numbers.
DOWNSTREAM FAILURES
−100%
P95 SEND LATENCY
+7s
THROUGHPUT
UNCHANGED
DLQ DEPTH (PEAK)
12
HOW IT WENT
The fan-out had been instant: campaign starts, every recipient gets queued, every queued message immediately fires a downstream request. The downstream rate-limits, then errors, then circuit-breaks. The campaign that finished in eight minutes also caused 90 minutes of webhook delivery failures.
We placed an SQS queue between the campaign engine and the downstream callers, with a Lambda consumer whose concurrency was bounded to a downstream-friendly rate. Token buckets per third-party provider absorbed their published rate limits. DLQ captured the few permanent failures for human review.
Burst-induced failures dropped to zero. P95 send latency went up about 7 seconds — the campaign now takes 15 minutes instead of 8, which the marketing customers couldn’t tell apart. The 24-hour throughput is the same; we just spread it over the available downstream capacity.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.