Zhivko Todorov
ALL CASE STUDIES

CASE 173 · VERNAL · 2023

MONITORINGSLAPLATFORM TEAMINTERNAL METRICS

The platform team’s SLA, made measurable.

A B2B SaaS platform team had an "internal SLA" with its application-team customers — uptime for shared services like the CI cluster, the artifact registry, the secrets store. The SLA was claimed; it was never measured. We built the measurement and a public-internal dashboard.

INDUSTRY

B2B SaaS

DOMAIN

PLATFORM

DELIVERED

2023

STACK

CLOUDWATCH SYNTHETICS·AMAZON MANAGED GRAFANA·BACKSTAGE·CLOUDWATCH ALARMS·PAGERDUTY

RESULTS

What changed, by the numbers.

SHARED SERVICES MEASURED

14

PLATFORM-OWNED

SLA TRANSPARENCY

INTERNAL-PUBLIC

EVERY APP TEAM SEES

SLA-VIOLATION RESPONSE

< 1h

AT BREACH

TRUST IN PLATFORM

+22 NPS

TEAM SURVEY

HOW IT WENT

The platform team had been frustrated that application teams routinely under-trusted them. "Is CI down?" was a recurring Slack question even when it wasn’t. The honest answer was that nobody had visibility — including the platform team — into whether CI was up or down at the moment.

CloudWatch Synthetics ran scripted health checks against the 14 shared services. The Grafana dashboard was internal-public — every engineer could see the current state of every platform service. PagerDuty routed alarm breaches to the platform team’s rotation.

Internal-NPS for the platform team improved 22 points in the quarter following rollout. SLA-violation response landed inside an hour. The "is CI down?" Slack pattern stopped — the answer was a Grafana link away.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →