Zhivko Todorov
ALL CASE STUDIES

CASE 158 · GRAVEL · 2024

SLOBATCHPIPELINESFRESHNESS

SLOs for batch jobs, not just synchronous APIs.

A data analytics company had SLOs for their synchronous API endpoints but nothing equivalent for their 22 batch pipelines. Pipeline freshness, completeness, and latency were operationally important but not measured. We introduced batch SLOs against freshness and completeness, with burn-rate alerting.

INDUSTRY

Data analytics

DOMAIN

RELIABILITY

DELIVERED

2024

STACK

CLOUDWATCH METRICS·STEP FUNCTIONS·EVENTBRIDGE·GRAFANA·CUSTOM SLO COMPUTATION

RESULTS

What changed, by the numbers.

BATCH SLO COVERAGE

22 / 22

ALL PIPELINES

STALENESS INCIDENTS

−74%

YEAR-OVER-YEAR

BURN-RATE ALERTS

ACTIVE

BEFORE EXHAUSTION

BATCH-RELIABILITY PRIORITISATION

AGAINST PRIOR QUARTER

HOW IT WENT

The synchronous SLOs worked well — they measured what customers experienced minute by minute. The batch side was opaque. A pipeline that should have produced an output by 06:00 might produce it at 11:00 and nobody noticed until a downstream team complained. There was no shared definition of "broken."

We defined two SLOs per pipeline: freshness (output produced by a target time) and completeness (a target percentage of input data made it through). Custom metric computation against CloudWatch tracked both. Burn-rate alerts via EventBridge surfaced trouble before the budget was exhausted.

All 22 pipelines have SLOs now. Staleness incidents dropped 74% year-over-year — most of the previous incidents had been silent. Batch-reliability work got 3x the prioritisation in planning because the SLOs surfaced where the real cost was.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →