CASE 67 · KNOT · 2024
Outages we hear about from monitoring, not customers.
A B2B logistics platform had had three outages in twelve months where customers reported the problem before monitoring did. The platform had monitoring — Prometheus, CloudWatch alarms, the works — but it monitored the components, not the customer-perceptible behaviour. We added CloudWatch Synthetics canaries running the critical user journeys every minute.
B2B logistics
RELIABILITY
2024
RESULTS
What changed, by the numbers.
MTTD (CUSTOMER JOURNEY)
< 2m
CUSTOMER-REPORTED FIRST
0
JOURNEYS MONITORED
14
FALSE POSITIVES
2 / mo
HOW IT WENT
The Prometheus metrics had been green during each of the three outages. Components were healthy. The problem was an interaction — the API was up, the database was up, but a flag-gating service was misbehaving and the combination broke checkout. No single component alarm had fired.
We wrote CloudWatch Synthetics canaries — actual Node.js scripts that drove a headless browser through fourteen critical user journeys: login, search, checkout, settings update, the high-value workflows. Each ran every minute from three regions. Alerts routed through EventBridge to PagerDuty.
Mean time to detection on customer-journey breakage dropped from 38 minutes to under 2 minutes. In the 90 days post-rollout, customers never reported a problem before monitoring did. False-positive rate stayed at about 2 per month — well-tuned against the noise floor.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.