Zhivko Todorov
ALL CASE STUDIES

CASE 21 · MOSAIC · 2025

RDSPOSTMORTEMBLUE/GREENPARAMETER GROUPS

The four-hour Postgres outage that came back as a one-page runbook.

A real-estate platform had a four-hour customer-visible Postgres outage on a Tuesday afternoon — a runaway query, a connection storm, an auto-failover that didn’t complete cleanly. We ran the postmortem, shipped the four highest-impact remediations, and turned the lessons into a one-page operations runbook.

INDUSTRY

Real-estate tech

DOMAIN

RELIABILITY

DELIVERED

2025

STACK

RDS POSTGRES·RDS PROXY·PERFORMANCE INSIGHTS·AURORA BLUE/GREEN·CLOUDWATCH ALARMS·PGBADGER

RESULTS

What changed, by the numbers.

RECURRENCE

0

90 DAYS POST-FIX

CONNECTION HEADROOM

+340%

VIA RDS PROXY

FAILOVER TIME

54s

WAS 11m DURING INCIDENT

MTTD

90s

WAS 23m

HOW IT WENT

The postmortem identified three independent failures stacked: an N+1 query a developer had added two weeks earlier, an undersized connection pool that storm-amplified the load, and a parameter group setting (`statement_timeout`) that hadn’t been propagated from staging.

The remediations were not glamorous. RDS Proxy in front of the writer absorbed the connection storms. Performance Insights baseline alarms caught query regressions before they cascaded. The Blue/Green deployment pattern handled parameter group changes without restarts. pgBadger ran weekly against the logs.

We left a one-page runbook for "when Postgres is misbehaving" — three commands, two dashboards, one Slack channel. The platform team has used it once since, on a much smaller incident. Time to detection was 90 seconds; time to mitigation was four minutes.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →