CASE 23 · ZENITH · 2024
SLOs that survive contact with quarterly planning.
A B2B logistics platform had monitoring, dashboards, and a "99.9% uptime" promise on their marketing site. They had no SLOs, no error budgets, and no way to make engineering trade-offs against reliability. We rolled out an SLO framework that survived its first quarterly planning cycle.
B2B logistics
RELIABILITY
2024
RESULTS
What changed, by the numbers.
SERVICES WITH SLOs
17 → 28
ERROR BUDGET CONSUMED
63%
RELIABILITY-WORK PRIORITISATION
4x
INCIDENT FREQUENCY
−41%
HOW IT WENT
The hardest part wasn’t the technology — it was the conversation about what "reliable" meant for the business. We ran a workshop per critical-path service: what does the customer feel when this breaks? what’s the latency they notice? what’s a tolerable error rate? The SLOs that came out of those workshops looked nothing like the marketing page.
Instrumentation was already mostly there in Prometheus and X-Ray. The framework added one Grafana dashboard per service showing burn rate, error budget consumed, and projected exhaustion. EventBridge alerted at the four-week burn-rate threshold so teams had a quarter to react.
The first quarterly planning under the new framework was the test. Two teams negotiated to spend their remaining error budget on a risky migration; one team proactively asked for an extra week on a release because their burn rate was already at 80%. The framework had become a real planning tool, not a slide deck.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.