CASE 21 · MOSAIC · 2025
The four-hour Postgres outage that came back as a one-page runbook.
A real-estate platform had a four-hour customer-visible Postgres outage on a Tuesday afternoon — a runaway query, a connection storm, an auto-failover that didn’t complete cleanly. We ran the postmortem, shipped the four highest-impact remediations, and turned the lessons into a one-page operations runbook.
Real-estate tech
RELIABILITY
2025
RESULTS
What changed, by the numbers.
RECURRENCE
0
CONNECTION HEADROOM
+340%
FAILOVER TIME
54s
MTTD
90s
HOW IT WENT
The postmortem identified three independent failures stacked: an N+1 query a developer had added two weeks earlier, an undersized connection pool that storm-amplified the load, and a parameter group setting (`statement_timeout`) that hadn’t been propagated from staging.
The remediations were not glamorous. RDS Proxy in front of the writer absorbed the connection storms. Performance Insights baseline alarms caught query regressions before they cascaded. The Blue/Green deployment pattern handled parameter group changes without restarts. pgBadger ran weekly against the logs.
We left a one-page runbook for "when Postgres is misbehaving" — three commands, two dashboards, one Slack channel. The platform team has used it once since, on a much smaller incident. Time to detection was 90 seconds; time to mitigation was four minutes.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.