CASE 158 · GRAVEL · 2024
SLOs for batch jobs, not just synchronous APIs.
A data analytics company had SLOs for their synchronous API endpoints but nothing equivalent for their 22 batch pipelines. Pipeline freshness, completeness, and latency were operationally important but not measured. We introduced batch SLOs against freshness and completeness, with burn-rate alerting.
Data analytics
RELIABILITY
2024
RESULTS
What changed, by the numbers.
BATCH SLO COVERAGE
22 / 22
STALENESS INCIDENTS
−74%
BURN-RATE ALERTS
ACTIVE
BATCH-RELIABILITY PRIORITISATION
3×
HOW IT WENT
The synchronous SLOs worked well — they measured what customers experienced minute by minute. The batch side was opaque. A pipeline that should have produced an output by 06:00 might produce it at 11:00 and nobody noticed until a downstream team complained. There was no shared definition of "broken."
We defined two SLOs per pipeline: freshness (output produced by a target time) and completeness (a target percentage of input data made it through). Custom metric computation against CloudWatch tracked both. Burn-rate alerts via EventBridge surfaced trouble before the budget was exhausted.
All 22 pipelines have SLOs now. Staleness incidents dropped 74% year-over-year — most of the previous incidents had been silent. Batch-reliability work got 3x the prioritisation in planning because the SLOs surfaced where the real cost was.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.