Zhivko Todorov
ALL CASE STUDIES

CASE 23 · ZENITH · 2024

SLOSLIERROR BUDGETSOBSERVABILITY

SLOs that survive contact with quarterly planning.

A B2B logistics platform had monitoring, dashboards, and a "99.9% uptime" promise on their marketing site. They had no SLOs, no error budgets, and no way to make engineering trade-offs against reliability. We rolled out an SLO framework that survived its first quarterly planning cycle.

INDUSTRY

B2B logistics

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

PROMETHEUS·GRAFANA·CLOUDWATCH METRICS·X-RAY·EVENTBRIDGE·NOBL9 (LATER GRAFANA SLO)

RESULTS

What changed, by the numbers.

SERVICES WITH SLOs

17 → 28

CRITICAL PATH COVERED

ERROR BUDGET CONSUMED

63%

AT END OF Q1

RELIABILITY-WORK PRIORITISATION

4x

AGAINST PRIOR QUARTER

INCIDENT FREQUENCY

−41%

AFTER FIRST FULL CYCLE

HOW IT WENT

The hardest part wasn’t the technology — it was the conversation about what "reliable" meant for the business. We ran a workshop per critical-path service: what does the customer feel when this breaks? what’s the latency they notice? what’s a tolerable error rate? The SLOs that came out of those workshops looked nothing like the marketing page.

Instrumentation was already mostly there in Prometheus and X-Ray. The framework added one Grafana dashboard per service showing burn rate, error budget consumed, and projected exhaustion. EventBridge alerted at the four-week burn-rate threshold so teams had a quarter to react.

The first quarterly planning under the new framework was the test. Two teams negotiated to spend their remaining error budget on a risky migration; one team proactively asked for an extra week on a release because their burn rate was already at 80%. The framework had become a real planning tool, not a slide deck.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →