CASE 156 · EDDY · 2025
Half a million open connections, no surprises.
A social platform’s chat feature ran on a self-managed WebSocket gateway that fell over once it crossed 80k concurrent connections. The team had been scaling vertically (bigger instances) and praying. We rebuilt on API Gateway WebSocket API with DynamoDB-backed connection state.
Social platform
RELIABILITY
2025
RESULTS
What changed, by the numbers.
CONCURRENT CONNECTIONS
520K
CONNECTION-DROP RATE
< 0.05%
OPERATIONAL HOURS
−94%
p99 MESSAGE LATENCY
< 90ms
HOW IT WENT
The self-managed gateway had been a clever piece of engineering when the platform was small. At scale, the operational model didn’t scale with it — connection state was in-process, broadcasts required cross-instance messaging the team had built by hand, and fault tolerance was "if the box dies, all clients reconnect at once."
API Gateway WebSocket API moved connection management to a managed service. Connection state moved to DynamoDB so any Lambda invocation could find the right route key. Broadcasts fanned out via SQS to keep the broadcaster latency-insensitive to subscriber count.
Peak concurrent connections hit 520k during a viral event without alerting on anything. Connection-drop rate stayed under 0.05% per hour. Operational hours dropped 94% — there’s no more "WebSocket gateway on-call." p99 message latency stays under 90ms steady-state.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.