Zhivko Todorov
ALL CASE STUDIES

CASE 67 · KNOT · 2024

CLOUDWATCH SYNTHETICSCANARYSLOMONITORING

Outages we hear about from monitoring, not customers.

A B2B logistics platform had had three outages in twelve months where customers reported the problem before monitoring did. The platform had monitoring — Prometheus, CloudWatch alarms, the works — but it monitored the components, not the customer-perceptible behaviour. We added CloudWatch Synthetics canaries running the critical user journeys every minute.

INDUSTRY

B2B logistics

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

CLOUDWATCH SYNTHETICS·CLOUDWATCH ALARMS·EVENTBRIDGE·PAGERDUTY·STATUSPAGE

RESULTS

What changed, by the numbers.

MTTD (CUSTOMER JOURNEY)

< 2m

WAS UP TO 38 MIN

CUSTOMER-REPORTED FIRST

0

90 DAYS POST-ROLLOUT

JOURNEYS MONITORED

14

CRITICAL PATH

FALSE POSITIVES

2 / mo

WELL-TUNED

HOW IT WENT

The Prometheus metrics had been green during each of the three outages. Components were healthy. The problem was an interaction — the API was up, the database was up, but a flag-gating service was misbehaving and the combination broke checkout. No single component alarm had fired.

We wrote CloudWatch Synthetics canaries — actual Node.js scripts that drove a headless browser through fourteen critical user journeys: login, search, checkout, settings update, the high-value workflows. Each ran every minute from three regions. Alerts routed through EventBridge to PagerDuty.

Mean time to detection on customer-journey breakage dropped from 38 minutes to under 2 minutes. In the 90 days post-rollout, customers never reported a problem before monitoring did. False-positive rate stayed at about 2 per month — well-tuned against the noise floor.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →