Zhivko Todorov
ALL CASE STUDIES

CASE 66 · QUARRY · 2023

RUNBOOKSON-CALLDOCUMENTATIONINCIDENT RESPONSE

On-call runbooks that the next person on rotation can actually use.

A B2B SaaS platform had eighteen services, eighteen different on-call rotations, and eighteen different runbook formats — most of them outdated or missing. New rotation members spent their first quarter in survival mode. We standardised the runbook format and the on-call onboarding.

INDUSTRY

B2B SaaS

DOMAIN

RELIABILITY

DELIVERED

2023

STACK

BACKSTAGE·PAGERDUTY·STATUSPAGE·NOTION → BACKSTAGE TEMPLATES·CLOUDWATCH DASHBOARDS

RESULTS

What changed, by the numbers.

NEW ROTATION RAMP-UP

< 2w

WAS 8–12 WEEKS

RUNBOOK COVERAGE

100%

EVERY ON-CALL SERVICE

ESCALATIONS / WEEK

−54%

JUNIOR → SENIOR

POSTMORTEM "RUNBOOK MISSING"

0

WAS A REGULAR LINE

HOW IT WENT

The first on-call shift for a new engineer was always rough. Some services had wiki pages from 2020. Some had Notion docs nobody could find. Some just had "ask the team lead." Postmortems regularly cited "runbook was missing/outdated/wrong" as a contributing factor.

We built a Backstage template for service runbooks with required sections: the four most common alerts, their causes, their first-step mitigations, the relevant dashboards. Each service owner ran a 90-minute workshop to fill in their template. CloudWatch dashboards were embedded by URL.

New rotation members now reach competence inside two weeks instead of two months. Escalations from junior on-call to senior dropped 54% — most alerts now have actionable runbooks at the first page. The "runbook missing" line item is gone from postmortems.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →