WORK · SELECTED CASE STUDIES

A few engagements, in detail.

Names anonymised in public copy where the client asked; numbers are real. Filter by domain or stack on the left, sort however helps, then open any brief for the full architecture and narrative.

176 CASE STUDIES

01
VANTAGE · 2026 · COST
A $400k monthly bill, with the right commitments for once.
A consumer SaaS company had a $400k/month AWS bill and a Savings Plans portfolio that someone had purchased in 2022 and never revisited. Coverage was 41%; effective rate was around 18%. We rebalanced the portfolio, paid down a chunk to Compute Savings Plans, and moved 35% of compute to Graviton.
SAVINGS PLANSRICOST EXPLORERGRAVITON
37%
EFFECTIVE SAVINGS
WAS 18%
READ
02
ARCO · 2026 · SECURITY
Retired eighty IAM users in three weeks.
An engineering SaaS company had eighty IAM users with long-lived access keys, half of them belonging to people who no longer worked there. We rolled out IAM Identity Center federated to Okta, migrated every human and machine identity, and deleted the IAM users without breaking a single workflow.
IAM IDENTITY CENTERSSOIAM USERSOKTA
80 → 0
IAM USERS RETIRED
INCLUDING SERVICE ACCOUNTS
READ
03
PILLAR · 2026 · RELIABILITY
An active-active payments API that runs everywhere, all the time.
A payments processor needed an active-active multi-region architecture for their core authorisation API — not warm-standby, not failover, but real concurrent serving from both regions with sub-second cross-region writes. We rebuilt the data layer on DynamoDB Global Tables and the routing on Route 53 latency records.
ACTIVE-ACTIVEMULTI-REGIONDYNAMODB GLOBALROUTE 53
< 30s
RTO (REGION LOSS)
AUTOMATIC HEALTH CHECKS
READ
04
LUMEN · 2026 · COST
Model training on Spot, checkpoint-resumed.
A computer vision team trained models on SageMaker with on-demand p3.16xlarge instances at roughly $24/hour each. A full training run took 96 hours. Their monthly training bill ran $42k. We moved training to managed Spot with checkpoint-resume and brought it under $11k.
SAGEMAKERSPOTTRAININGCHECKPOINTING
−74%
TRAINING COST
$42K → $11K / MONTH
READ
05
RIPTIDE · 2026 · PLATFORM
An infrastructure portal the application teams actually use.
An AI startup’s platform team had been the bottleneck for every new piece of infrastructure — every S3 bucket, every Aurora instance, every SQS queue. We built a self-service portal on Backstage that engineers used to provision infra against Crossplane-managed AWS resources.
SELF-SERVICEINFRASTRUCTURECROSSPLANEBACKSTAGE
−88%
PLATFORM-TEAM TICKETS
INFRA REQUESTS
READ
06
PINION · 2026 · SECURITY
Bots stopped, humans didn’t notice.
A ticketing platform was losing high-demand event ticket releases to scalper bots. Blocking outright was too aggressive (false positives killed legitimate buyers); CAPTCHA was too rude (drop-off was measurable). AWS WAF’s Challenge action — a silent client-side cryptographic puzzle — let us stop the bots without showing a CAPTCHA to humans.
WAFBOT CONTROLCHALLENGECAPTCHA
−94%
BOT TICKET PURCHASES
PER RELEASE
READ
07
WISTERIA · 2026 · MIGRATION
Marketplace payouts, run by Stripe, not by our compliance team.
A multi-vendor marketplace had been handling vendor payouts via a homegrown ACH pipeline with manual KYC review. As the marketplace grew, the compliance overhead grew faster. We migrated to Stripe Connect Custom — Stripe handles KYC, payout rails, and 1099 reporting; we handle marketplace logic.
STRIPE CONNECTPAYOUTSCOMPLIANCEMARKETPLACE
< 2d
KYC TIME-TO-PAYOUT
WAS 9–14 DAYS
READ
08
BRIAR · 2026 · RELIABILITY
A bad deploy that only one customer notices.
A SaaS platform with 800 tenants on a shared infrastructure had had two "every customer affected" outages in twelve months. We adopted a cell-based architecture with shuffle-sharded tenant placement; the next bad deploy affected one cell of 32 tenants, not all 800.
CELL-BASEDSHUFFLE-SHARDINGEKSROUTE 53
32 / 800
BLAST RADIUS
PER FAILURE EVENT
READ
09
WILDER · 2026 · SECURITY
Developers who can create IAM roles, safely.
A developer tools company had a "no, we won’t give you IAM:CreateRole permissions" stance from the security team and a "we file two tickets a week for new IAM roles" stance from the engineering team. We resolved it with IAM permission boundaries — engineers can create roles, but only roles bounded by a security-approved policy.
IAMPERMISSION BOUNDARIESDELEGATIONSELF-SERVICE
−92%
IAM-TICKET QUEUE
TO SECURITY TEAM
READ
10
MAVEN · 2026 · SECURITY
Service-to-service mTLS without a service mesh.
A healthcare platform had been told they needed Istio to do mutual TLS between services. The team had tried Istio twice and walked away both times. We delivered the same security property with VPC Lattice and IAM auth in three weeks.
VPC LATTICEmTLSIAM AUTHSERVICE MESH
100%
mTLS COVERAGE
SERVICE-TO-SERVICE
READ
11
GLADE · 2026 · PLATFORM
Twelve EKS clusters, one application surface.
An AI infrastructure company ran 12 EKS clusters across regions and GPU classes. Deploying anything across the whole estate was 12 separate kubectl applies plus a spreadsheet to track them. We adopted Karmada for cluster federation and turned cross-cluster operations into a single declaration.
EKSKARMADAMULTI-CLUSTERFEDERATION
−83%
CROSS-CLUSTER DEPLOY TIME
90m → 15m
READ
12
CARDINAL · 2026 · PLATFORM
Build provenance that the supply chain can actually verify.
An open-source vendor shipping binaries and container images to thousands of users had had two near-misses with dependency-substitution attacks. Customers had started asking, in writing, for SLSA-level provenance. We rebuilt the build pipeline to emit SLSA L3-compliant attestations.
SLSACOSIGNPROVENANCEOIDC
L3
SLSA LEVEL
BUILD PROVENANCE
READ
13
LARCH · 2026 · LANDING ZONE
Guardrails that fail the deploy, not the audit.
A B2B fintech enforced guardrails detectively — Config rules caught misconfigurations after they shipped. Audit kept finding ten-minute-old non-compliance. We moved the most-violated controls to CloudFormation Hooks so they fail the deploy synchronously.
CFN HOOKSPREVENTIVE CONTROLSIaCGUARDRAILS
−96%
POST-DEPLOY VIOLATIONS
AGAINST TARGETED CONTROLS
READ
14
HAZEL · 2026 · COST
Training commitments that match the model release cadence.
A computer vision company shipped a new model version every six weeks, with each release requiring 80+ hours of GPU training. The training was on-demand; the bill was lumpy and expensive. We added SageMaker-aware Savings Plans matched to the release cadence and brought average effective rate to 41% off list.
SAGEMAKERSAVINGS PLANSTRAININGCOMPUTE
−41%
EFFECTIVE TRAINING RATE
OFF LIST
READ
15
ZINNIA · 2026 · PLATFORM
Engineering metrics the CTO can show the board.
A public SaaS CTO needed engineering health metrics for board reporting — beyond DORA. We assembled a comprehensive site-reliability metric set covering golden signals, error-budget burn, deploy quality, on-call load, and platform reliability into a single quarterly report.
SREMETRICSGOLDEN SIGNALSOBSERVABILITY
14
BOARD-LEVEL METRICS
AUTOMATED
READ
16
LINDEN · 2026 · SECURITY
Container images you can prove the lineage of.
A devtools company shipping container images to enterprise customers had to answer "what’s in this image" for every customer security questionnaire. The answers were assembled by hand each time. We built a supply-chain pipeline that ships SBOMs and signatures with every image.
SBOMCOSIGNECRSLSA
100%
SBOM COVERAGE
OF SHIPPED IMAGES
READ
17
EMBER · 2026 · LANDING ZONE
AI model access, scoped to who is allowed to use which.
An AI products company let any engineer call any Bedrock model in any region. Compliance was uncomfortable; finance was alarmed at the spend per individual experiment. We rolled out SCPs that scoped model access per OU, with named-model approvals and region restrictions.
SCPBEDROCKAI GOVERNANCEMODEL ACCESS
−72%
AI BILL VOLATILITY
MONTH-OVER-MONTH
READ
18
TRELLIS · 2026 · SECURITY
Secrets that rotate themselves, even the third-party ones.
A travel tech platform had Secrets Manager but used it as "Parameter Store with a fancier name" — no rotation, no audit. The first audit finding said so. We turned on rotation for every secret in scope, including the third-party API keys that everyone had assumed couldn’t be rotated.
SECRETS MANAGERROTATIONRDSLAMBDA
100%
SECRETS WITH ROTATION
IN SCOPE
READ
19
BEACON · 2026 · LANDING ZONE
Guardrails that stop drift before HIPAA notices.
A digital health company expanded from one clinical product to three, ran into HIPAA audit findings on configuration drift, and asked for guardrails strict enough to stop drift but loose enough to ship. We built a tiered Service Control Policy framework, tested it against three months of historical CloudTrail, and rolled it to ninety accounts.
ORGANIZATIONSHIPAASCPGUARDRAILS
−100%
HIPAA FINDINGS
AT QUARTERLY AUDIT
READ
20
VECTOR · 2026 · PLATFORM
Backstage, but only the parts that earned their keep.
An enterprise SaaS company had eighty services, fifty engineers, and three different ways to provision new infrastructure depending on which platform engineer you asked. We built a Backstage IDP with three golden paths — each one ten minutes from idea to running service — and resisted the temptation to add more.
BACKSTAGEIDPGOLDEN PATHSTERRAFORM
10m
NEW SERVICE PROVISIONING
WAS 2 BUSINESS DAYS
READ
21
ATLAS · 2026 · LANDING ZONE
A data lake that the legal team actually signed off on.
A streaming media company had four analytics teams, three of them writing to the same S3 bucket with overlapping prefixes, and a legal team that had quietly stopped reading the architecture diagrams. We rebuilt the analytics estate as a producer-consumer data lake with fifteen accounts, Lake Formation governance, and one row-level access policy per consumer.
ORGANIZATIONSDATA LAKELAKE FORMATIONS3
−96%
CROSS-TEAM INCIDENTS
COLLISIONS IN SHARED PREFIXES
READ
22
HALCYON · 2026 · LANDING ZONE
A network that doesn’t need a senior engineer to debug.
A cross-border payments company had grown from one region to four with ad-hoc VPC peering between every pair of VPCs. The mesh had 38 connections and one person who understood it. We collapsed it to a Transit Gateway spoke-and-hub per region, with inter-region peering and a route table per traffic class.
TRANSIT GATEWAYNETWORKMULTI-REGIONDIRECT CONNECT
38 → 11
VPC CONNECTIONS
INTER-VPC PATHS
READ
23
ACME · 2026 · LANDING ZONE
Cut the AWS bill in half, kept the SLA.
A Series-B fintech on a single AWS account, growing 30% MoM, with auditors three weeks out. We rebuilt the landing zone on Control Tower and shipped the SOC 2 Type II remediation work in nine weeks — without a single migration weekend.
CONTROL TOWERSOC 2MULTI-REGIONAURORA
−44%
AWS BILL
$61K → $34K / MONTH
READ
24
MURMUR · 2025 · PLATFORM
Self-hosted observability that actually saves money.
A climate analytics company had been on a fully-managed observability vendor at $52k/mo. They had the engineering capacity to operate their own but had assumed it would be more expensive once you priced in the time. We built a self-hosted Grafana + Tempo + Loki stack on EKS and brought the all-in cost (including operational time) under $14k/mo.
GRAFANATEMPOLOKIOPEN TELEMETRY
−73%
ALL-IN COST
$52K → $14K / MONTH
READ
25
LATTICE · 2025 · COST
Four petabytes of S3, half the bill.
A geospatial analytics company had four petabytes of raster imagery in S3, growing 50TB a month. Everything was Standard. The bill was $94k a month and rising. We mapped actual access patterns, ran a lifecycle migration, and dropped storage cost 51% without touching application code.
S3INTELLIGENT TIERINGLIFECYCLEGLACIER
−51%
STORAGE BILL
$94K → $46K / MONTH
READ
26
TEMPO · 2025 · MIGRATION
Self-hosted Kafka, retired without losing a partition.
A real-time bidding platform ran 28 self-hosted Kafka brokers across three AZs, with a team that had become reluctant ZooKeeper experts. We migrated to MSK Serverless for the variable workloads and MSK Provisioned for the steady ones, retired the EC2 cluster, and reclaimed the team’s attention.
KAFKAMSKCONNECTSCHEMA REGISTRY
−92%
OPERATIONAL HOURS / WEEK
TEAM RECLAIMED 38h/wk
READ
27
FOSTER · 2025 · SECURITY
Two million users migrated, no password resets.
A consumer marketplace had two million users on a custom-rolled auth service that the team didn’t want to maintain anymore. We migrated to Cognito User Pools without forcing any user to reset their password — using Cognito’s lazy-migration trigger to verify credentials against the legacy store on first sign-in.
COGNITOMIGRATIONCUSTOM AUTHPASSWORDS
0
PASSWORD RESETS REQUIRED
TRANSPARENT MIGRATION
READ
28
CALYPSO · 2025 · SECURITY
Customer keys, in the customer’s custody.
A B2B SaaS company had been losing enterprise deals over a recurring objection: "we want our data encrypted with keys we control, not keys you control." We added BYOK support via AWS KMS External Key Store and unblocked $3.6M in pipeline.
BYOKKMSXKSCOMPLIANCE
$3.6M
PIPELINE UNBLOCKED
PREVIOUSLY BLOCKED BY KEY OWNERSHIP
READ
29
TENDRIL · 2025 · PLATFORM
Incident playbooks that run themselves while you’re still putting on your laptop.
A fintech’s incident response playbooks were a Notion page the on-call read while diagnosing. Steps like "restart the unhealthy ECS service" or "drain the affected AZ" were manual, with the on-call’s attention split between reading and doing. We automated the high-confidence remediation steps with SSM Automation runbooks.
INCIDENT RESPONSEAUTOMATIONSSM RUNBOOKSPAGERDUTY
< 3m
MTTR (AUTOMATED CASES)
WAS 20–45m
READ
30
HEXA · 2025 · COST
A real-time pipeline that paid for itself in six weeks.
A healthcare data company was running a near-real-time pipeline on six always-on EC2 instances behind an Application Load Balancer, processing roughly 40 events a second at peak. The bill was $8,200 a month. We rebuilt it serverless and brought it to $640.
LAMBDAEVENTBRIDGECOSTSERVERLESS
−92%
COMPUTE BILL
$8.2K → $640 / MONTH
READ
31
TUNDRA · 2025 · SECURITY
SOC 2 Type II evidence that gathers itself.
A B2B SaaS company had passed SOC 2 Type I in 2023 and was due for Type II in 2025. The Type I prep had consumed two senior engineers for three months. We automated 78% of the evidence collection so Type II prep took two engineer-weeks total.
SOC 2AUDIT MANAGERCONFIGAUTOMATION
78%
EVIDENCE AUTOMATED
OF 167 CONTROLS
READ
32
LYRIC · 2025 · RELIABILITY
Deploys that the customer doesn’t notice. Not even the canary one.
A consumer mobile app had a deployment process that took the API into degraded mode for 20-30 seconds during each release, three times a day. Players noticed; reviews complained. We replaced it with ALB-weighted canaries and blue/green deploys that don’t touch the served traffic shape.
ALB WEIGHTEDCANARYCODE DEPLOYBLUE/GREEN
0s
DEPLOY-INDUCED DEGRADATION
WAS 25s × 3 / DAY
READ
33
TAMARIND · 2025 · PLATFORM
CI builds that finish before the engineer remembers they pushed.
A developer infrastructure company had CI pipelines averaging 14 minutes per build, dominated by dependency installation and container layer rebuilds. We added an S3-backed build cache for the dependency steps and ECR layer caching for the container builds. Median build time landed at 92 seconds.
BUILD CACHES3ECRCI
92s
MEDIAN BUILD TIME
WAS 14m
READ
34
BURROW · 2025 · SECURITY
Key custody that the regulator certifies, not promises.
A financial market infrastructure firm needed FIPS 140-2 Level 3 key custody for signing trade settlement messages. KMS Level 3 hadn’t quite landed for this customer’s region. We deployed CloudHSM Cluster with a custom integration layer and got regulator sign-off in twelve weeks.
CLOUDHSMKEY CUSTODYFIPS 140-2HSM
L3
FIPS 140-2 LEVEL
CERTIFIED
READ
35
MAGNOLIA · 2025 · COST
Images optimised at request, not stored in twelve sizes.
A fashion e-commerce site stored each product image in twelve pre-sized variants (thumbnail, mobile, tablet, desktop, retina × 2-3 use cases). The S3 bill was dominated by image storage. We moved to on-the-fly resizing at the edge using Lambda@Edge plus CloudFront, deleting the pre-sized variants.
CLOUDFRONTLAMBDA@EDGEIMAGE OPTIMISATIONS3
−82%
S3 STORAGE BILL
IMAGES ONLY
READ
36
EDDY · 2025 · RELIABILITY
Half a million open connections, no surprises.
A social platform’s chat feature ran on a self-managed WebSocket gateway that fell over once it crossed 80k concurrent connections. The team had been scaling vertically (bigger instances) and praying. We rebuilt on API Gateway WebSocket API with DynamoDB-backed connection state.
WEBSOCKETAPI GATEWAYDYNAMODBSCALING
520K
CONCURRENT CONNECTIONS
PEAK MEASURED
READ
37
HONEYCOMB · 2025 · PLATFORM
Dev environments that match production, on the engineer’s first day.
A B2B SaaS company onboarded new engineers with a multi-day "set up your dev environment" experience that produced different states on different laptops. We standardised on devcontainers, ran them through GitHub Codespaces for new hires, and got everyone to parity.
DEVCONTAINERSCODESPACESDEV ENVPARITY
< 30m
TIME TO FIRST RUNNING APP
NEW HIRE
READ
38
XYSTUS · 2025 · PLATFORM
Monorepo builds that finish in seconds, not minutes.
A developer tools company had a 40-package TypeScript monorepo with full-rebuild times of 14 minutes on CI and roughly 6 minutes locally. Most package changes touched two or three packages, not 40. We adopted TurboRepo with a remote cache on S3 and got incremental builds to the seconds they should be.
MONOREPOTURBOREPOBUILD CACHES3
< 8s
INCREMENTAL BUILD TIME
PER PACKAGE CHANGE
READ
39
COBALT · 2025 · COST
Karpenter, Spot, and the EKS bill that fell sixty percent.
An AdTech company ran 14 EKS clusters with Cluster Autoscaler, a fixed set of node groups, and an on-demand-only policy. The compute bill was $180k/month and the autoscaler routinely ran clusters at 38% utilisation. We replaced it with Karpenter, opened the cluster to Spot, and pulled utilisation past 70%.
EKSKARPENTERSPOTBIN-PACKING
−60%
COMPUTE BILL
$180K → $72K / MONTH
READ
40
VINTAGE · 2025 · COST
Compute that runs the same workload for thirty percent less.
A consumer publishing platform ran their entire fleet on Amazon Linux 2 with x86 instances, despite roughly two-thirds of the workloads being ARM-compatible. We migrated to Amazon Linux 2023 on Graviton-based instances, one workload class at a time.
AMAZON LINUX 2023GRAVITONEC2COMPUTE
−31%
COMPUTE BILL
ON MIGRATED FLEET
READ
41
GRANITE · 2025 · SECURITY
Malware caught at the volume, not at the customer.
A manufacturing tech company had a customer-uploaded file scanning gap — uploads landed in S3 and were processed by EKS pods without antivirus inspection. After a near-miss with an infected upload, we deployed GuardDuty Malware Protection for both EBS and EKS, with an automated quarantine flow.
GUARDDUTYMALWARE PROTECTIONEBSEKS
100%
SCAN COVERAGE
OF CUSTOMER UPLOADS
READ
42
TALISMAN · 2025 · SECURITY
S3 endpoints scoped to the buckets they’re allowed to talk to.
A B2B SaaS company’s S3 Gateway Endpoints were configured with the default open policy — any S3 bucket, any operation. A compromised IAM principal would have had a clean exfiltration path. We tightened the endpoint policies to a per-environment allowlist of buckets and operations.
VPC ENDPOINTSENDPOINT POLICIESS3EXFIL DEFENSE
−98%
EXFIL ATTACK SURFACE
AGAINST DEFAULT ALLOW
READ
43
ANCHOR · 2025 · COST
Origin Shield, sized to actually pay for itself.
A software distribution company served 14PB/month from CloudFront with origins in S3 across two regions. Cache hit rate at the edge was good (84%); origin fetches were still expensive because they hit S3 from every edge location independently. We added Origin Shield and brought origin fetches down 73%.
CLOUDFRONTORIGIN SHIELDCACHEEGRESS
−73%
ORIGIN FETCHES
S3 + EGRESS COMBINED
READ
44
BRINK · 2025 · MIGRATION
Search relevance the team can tune themselves.
A job board was paying Algolia $7,600/month for search that returned 18M monthly queries against 800k job listings. The pricing tier was tight; the relevance tuning was opaque. We migrated to OpenSearch Service with a careful relevance reconciliation.
ALGOLIA → OPENSEARCHSEARCHRELEVANCECOST
−63%
SEARCH BILL
$7.6K → $2.8K / MONTH
READ
45
SABER · 2025 · RELIABILITY
Critical-path Lambdas that throttle other workloads, not the customer.
A B2B platform had 80+ Lambda functions sharing the regional concurrency limit. A spike in batch-processing Lambdas had once throttled the customer-facing checkout flow. We added reserved concurrency for the critical path and provisioned concurrency for the latency-sensitive piece.
LAMBDACONCURRENCYRESERVEDTHROTTLING
0
CHECKOUT THROTTLE EVENTS
90 DAYS POST-FIX
READ
46
CONDUIT · 2025 · RELIABILITY
Indexes that match the queries, not the wishes.
An IoT telemetry platform’s DynamoDB tables had been designed for the queries the team imagined they would need. A year in, the actual access patterns had drifted. Read latency was up, costs were up, and one GSI was scanning 80% of the table on every query. We redesigned the indexes around the real access patterns.
DYNAMODBGSIACCESS PATTERNSCHEMA
−63%
READ COST
GSI-DRIVEN
READ
47
OASIS · 2025 · PLATFORM
A design system the front-end team actually uses.
A B2C marketplace had a Figma design system that didn’t map cleanly to the React component library, and a React component library that didn’t cleanly match what was shipping. We rebuilt around Storybook with Chromatic visual regression tests, making the design system the single source of truth.
STORYBOOKDESIGN SYSTEMCOMPONENTSCHROMATIC
0
COMPONENT DRIFT INCIDENTS
CHROMATIC-CAUGHT
READ
48
HELIX · 2025 · SECURITY
Eight regions, one key strategy, no manual rotations.
A biotech research platform encrypted everything — S3, RDS, EBS, Secrets Manager — but had grown organically to 340 KMS keys across eight regions with no consistent rotation policy. Some keys were untagged. Two were orphaned. We rebuilt the key strategy from scratch.
KMSMULTI-REGIONENCRYPTIONKEY ROTATION
340 → 47
KEYS MANAGED
AFTER CONSOLIDATION
READ
49
SLATE · 2025 · COST
Observability cost, brought back to earth.
A mobile gaming studio had a $61k/month CloudWatch Logs bill, primarily driven by Lambda execution logs at the most verbose tier, retained for 90 days "just in case." We applied a tiered retention policy, sampled the noisy log streams, and got the bill to $14k without losing investigative power.
CLOUDWATCH LOGSRETENTIONCONTRIBUTOR INSIGHTSCOST
−77%
LOGS BILL
$61K → $14K / MONTH
READ
50
TASSEL · 2025 · MIGRATION
Off Cloudflare, onto the cloud the rest of the stack already lives on.
A SaaS analytics company had Cloudflare in front of their AWS-hosted application — paying a Cloudflare bill, terminating TLS twice, and operating two CDN configurations. We migrated edge, WAF, and DNS to CloudFront + AWS WAF + Route 53 with zero customer-visible disruption.
CLOUDFLARE → CLOUDFRONTWAFEDGEDNS
−$8.2K/mo
EDGE STACK BILL
NET CDN+WAF+DNS
READ
51
GLYPH · 2025 · MIGRATION
WAF in front of the application, instead of in front of the firewall.
A government services provider had Imperva Cloud WAF on a separate edge-stack contract with a five-figure monthly bill. AWS WAF on CloudFront, with managed rule groups, did most of what Imperva did at a quarter of the cost. We migrated the WAF and added Shield Advanced for the DDoS protection.
IMPERVA → AWS WAFWAFDDoSSHIELD
−74%
WAF + DDoS BILL
NET OF SHIELD ADVANCED
READ
52
YACHT · 2025 · RELIABILITY
Targets that fail the right way, not the wrong way.
A streaming media company had an ALB target group where unhealthy instances took 90 seconds to drain and healthy-but-slow instances kept serving traffic. The combination caused a measurable customer-impact event during deploys. We tuned the health-check parameters carefully.
ALBTARGET GROUPHEALTH CHECKSTUNING
−92%
DEPLOY-WINDOW INCIDENTS
YEAR-OVER-YEAR
READ
53
CIPHER · 2025 · SECURITY
No public ingress, no VPN, no internet egress from prod.
A fintech with strict regulator expectations had a VPC topology that "worked but felt wrong" — a public ALB, a third-party WAF appliance, an open egress NAT. Their CISO wanted defensible boundaries. We redesigned the production VPC for zero-trust: no public ingress, no internet egress, every service-to-service call authenticated.
VPCPRIVATELINKWAFZERO TRUST
0
PUBLIC INGRESS
EVERYTHING VIA CLOUDFRONT
READ
54
CASCADE · 2025 · COST
Per-tenant cost, calculated from the CUR.
A B2B SaaS company knew their gross margin but not their per-customer margin. The largest customer was suspected to be unprofitable, the smallest was suspected to be highly profitable, and nobody could prove either. We built a cost-allocation pipeline that produces per-tenant cost reports nightly.
MULTI-TENANTCURCHARGEBACKATHENA
100%
TENANT MARGIN VISIBILITY
WAS GROSS-ONLY
READ
55
GLIMMER · 2025 · MIGRATION
Off VMware, with an Outposts cushion for the workloads that needed it.
A broadcast media company had 60 VMs on VMware running everything from front-office IT to broadcast control systems. Most could move to EC2 outright; a handful needed sub-millisecond latency to studio hardware. We rehosted the bulk to EC2 and ran the broadcast-adjacent workloads on AWS Outposts in their facility.
VMWAREEC2OUTPOSTSMGN
60
VMS MIGRATED
ZERO OUTAGES
READ
56
SPARTAN · 2025 · COST
Aurora I/O-Optimized for the workload, Standard for the rest.
A trading systems company ran a heavy Aurora Postgres cluster — high IOPS, predictable I/O cost. The team had stayed on Aurora Standard because they didn’t know what Aurora I/O-Optimized was. We modelled the workloads and migrated the right clusters to I/O-Optimized.
AURORAIO-OPTIMIZEDSTORAGECOST
−27%
AURORA COST
ON MIGRATED CLUSTERS
READ
57
MAGMA · 2025 · MIGRATION
HashiCorp Vault, retired in favour of the AWS-native equivalent.
A real-estate tech company ran self-hosted HashiCorp Vault as the secrets backbone for AWS-hosted workloads. Vault worked well but operating it had become a steady tax. We migrated secrets to AWS Secrets Manager and SSM Parameter Store, on a path that retired Vault entirely.
VAULTSECRETS MANAGERMIGRATIONCOST
GONE
VAULT OPERATIONAL TAX
~6h/wk RECLAIMED
READ
58
VORTEX · 2025 · RELIABILITY
Events you can replay, even from a Tuesday last quarter.
An e-commerce platform had a Lambda consumer behind EventBridge that had silently been throwing on a malformed event for six hours during a deploy. The downstream order-update emails for that window had never been sent. We turned on EventBridge Archive and Replay so the next such incident would be a recoverable one.
EVENTBRIDGEEVENT REPLAYARCHIVEINCIDENT RECOVERY
90d
EVENT REPLAY WINDOW
ARCHIVED
READ
59
KRAIT · 2025 · PLATFORM
Internal docs that the new engineer can actually find.
An engineering org had documentation across Notion, Backstage, GitHub READMEs, and a dozen old Confluence pages nobody had migrated. Searching for an answer meant guessing where it might live. We deployed Amazon Kendra against all four sources with a unified search interface.
SEARCHOPENSEARCHINTERNAL DOCSKENDRA
−68%
TIME-TO-FIND-ANSWER
NEW-HIRE SURVEY
READ
60
POLARIS · 2025 · PLATFORM
GitOps for infrastructure, not just for Kubernetes manifests.
A climate-tech company had GitOps for their EKS workloads via ArgoCD but managed their AWS infrastructure through a mix of Terraform, the AWS Console, and one rogue Pulumi project. We unified everything under GitOps with Crossplane — including the AWS infrastructure.
GITOPSARGOCDCROSSPLANEEKS
0
INFRA OUTSIDE GITOPS
WAS ~40%
READ
61
HALYARD · 2025 · MIGRATION
Forty terabytes of Spark, off GCP in nine weeks.
A marketing analytics company ran a 40TB nightly Spark pipeline on GCP Dataproc with BigQuery storage. Their largest customer’s preferred-cloud clause triggered a forced migration. We rebuilt the pipeline on EMR + Redshift Spectrum + S3 without rewriting a single transformation.
GCP → AWSDATAPROC → EMRBIGQUERY → REDSHIFTSPARK
9w
TIMELINE
KICKOFF → CUTOVER
READ
62
BOREAL · 2025 · LANDING ZONE
One network, many accounts — without VPC peering hairballs.
A robotics infrastructure company had 19 accounts, each with its own VPC, and a peering mesh that took 90 minutes to draw on a whiteboard. We collapsed the architecture with shared VPCs via Resource Access Manager and a single Transit Gateway hub.
RAMVPC SHARINGTRANSIT GATEWAYNETWORK
24 → 0
PEERING CONNECTIONS
COLLAPSED
READ
63
LUSTRE · 2025 · SECURITY
Encrypted replication that doesn’t need cross-region role gymnastics.
A digital health platform replicated PHI buckets across regions with a cross-region trust dance that nobody fully trusted. Auditors had questions. We rebuilt the encryption on KMS multi-region keys so the same key material exists in both regions, eliminating the trust path.
KMSMULTI-REGION KEYSS3 REPLICATIONENCRYPTION
−100%
CROSS-REGION TRUST PATHS
ELIMINATED
READ
64
HARBOR · 2025 · LANDING ZONE
New AWS account in twelve minutes. No tickets.
A logistics tech company had grown to forty engineering teams. Spinning up a new AWS account took six business days and three tickets across IT, Finance, and Platform. The bottleneck was the Platform team. We replaced the process with a self-service account vending pipeline.
ACCOUNT VENDINGCONTROL TOWERBACKSTAGESCP
12m
TIME-TO-ACCOUNT
FROM 6 BUSINESS DAYS
READ
65
SABLE · 2025 · LANDING ZONE
Configuration drift, caught before audit found it.
An insurance carrier had AWS Config running in 47 accounts but the dashboard hadn’t been opened in months — and the auditors had started flagging the drift Config could have caught. We wired up an aggregator, defined a baseline of 23 conformance rules, and turned on auto-remediation for the safe ones.
AWS CONFIGAGGREGATORAUTO-REMEDIATIONSSM AUTOMATION
−94%
DRIFT FINDINGS (AUDIT)
YEAR-OVER-YEAR
READ
66
CORAL · 2025 · SECURITY
PII in S3, before it lands in the wrong bucket.
An education platform discovered student PII in three S3 buckets that hadn’t been intended to hold it — uncovered by a junior engineer running an ad-hoc Athena query for an unrelated reason. We rolled out Macie across the org and built a DLP pipeline that catches new PII drops before they’re queryable.
MACIEDLPS3PII
441
BUCKETS SCANNED
ACROSS 12 ACCOUNTS
READ
67
CINDER · 2025 · LANDING ZONE
Account baselines that stay applied.
An insurance carrier with 64 accounts had baselines that worked at provision time and drifted within a quarter. We rebuilt the governance layer with CloudFormation StackSets driven from the management account, with auto-update on baseline changes.
STACKSETSGOVERNANCECONTROL TOWERAUTOMATION
0
BASELINE DRIFT
AGGREGATED ACROSS ACCOUNTS
READ
68
ARGON · 2025 · SECURITY
Third-party SaaS, pulled inside the VPC boundary.
A B2B finance company sent customer data to four third-party SaaS vendors over the public internet — analytics, observability, error tracking, fraud signals. Their security team had been quietly uncomfortable. We moved every integration where the vendor supported it to PrivateLink, with audit trails per direction.
VENDOR RISKPRIVATELINKSAASINTEGRATION
−81%
EGRESS TO PUBLIC INTERNET
CUSTOMER DATA FLOWS
READ
69
MOSAIC · 2025 · RELIABILITY
The four-hour Postgres outage that came back as a one-page runbook.
A real-estate platform had a four-hour customer-visible Postgres outage on a Tuesday afternoon — a runaway query, a connection storm, an auto-failover that didn’t complete cleanly. We ran the postmortem, shipped the four highest-impact remediations, and turned the lessons into a one-page operations runbook.
RDSPOSTMORTEMBLUE/GREENPARAMETER GROUPS
0
RECURRENCE
90 DAYS POST-FIX
READ
70
YARROW · 2025 · LANDING ZONE
A region-expansion playbook that runs in four weeks, not four months.
A consumer marketplace expanded from US-East to four additional regions over eighteen months. The first expansion took four months. We worked on the second and wrote a playbook; the third and fourth took four weeks each, with the same engineering team.
MULTI-REGIONTERRAFORMPLAYBOOKGLOBAL ACCELERATOR
4w
TIME PER REGION
WAS 4 MONTHS
READ
71
WEXLEY · 2025 · SECURITY
WAF rules that drop the right traffic, not the rest.
A B2C subscription business had AWS WAF in front of CloudFront with three managed rule groups and 14% of legitimate signup traffic getting falsely blocked. We rebuilt the WAF configuration with custom rules, AWS Bot Control, and a careful tuning loop using sampled requests.
AWS WAFRULESBOT CONTROLCLOUDFRONT
−93%
FALSE POSITIVES
WAS 14% OF SIGNUPS
READ
72
TARN · 2025 · LANDING ZONE
AWS Marketplace, with procurement in the loop again.
A defence systems integrator had 47 active AWS Marketplace subscriptions, most of them purchased by engineers with the corporate card permissions Marketplace grants. Procurement found out about each one only on the invoice. We built a governance layer with private offers, subscription approvals, and an org-level marketplace block.
MARKETPLACEGOVERNANCEPROCUREMENTSCP
0
SHADOW PROCUREMENT
PROCUREMENT-VISIBLE BY DEFAULT
READ
73
KARST · 2025 · LANDING ZONE
Sharing without IAM acrobatics.
A genomics research org shared a Transit Gateway, a Glue Catalog, and three private hosted zones across nineteen accounts. The setup worked but every share required custom cross-account IAM and the platform team owned every change. We replaced the bespoke sharing with AWS Resource Access Manager.
RAMCROSS-TEAM SHARINGTRANSIT GATEWAYGLUE CATALOG
−93%
CUSTOM CROSS-ACCT IAM POLICIES
REPLACED BY RAM
READ
74
AURUM · 2025 · MIGRATION
Azure to AWS, twenty-two services, no rewrite.
A healthcare ISV with twenty-two production services on Azure (AKS, Azure SQL, App Configuration, Service Bus) needed to leave Azure inside seven months — driven by their largest customer’s "AWS-only" mandate. We migrated everything to equivalent AWS services without rewriting application code.
AZURE → AWSAKS → EKSAZURE SQL → AURORAHIPAA
6m
TIMELINE
7-MONTH HARD DEADLINE
READ
75
ORBIT · 2025 · SECURITY
PCI DSS on a multi-tenant platform, without forking the cluster.
A B2B payments platform needed PCI DSS Level 1 for their largest customer — but their architecture team had been told it would require a separate cluster and six months of work. We delivered it in eleven weeks on the existing EKS estate.
EKSPCI DSSMULTI-TENANTKUBERNETES
−72%
PCI SCOPE
CARDHOLDER DATA ENVIRONMENT
READ
76
RIVET · 2025 · PLATFORM
An internal API gateway that engineers prefer over service URLs.
An enterprise SaaS company had 40 internal services with a wall of internal load balancer URLs that nobody could remember. We built an internal API gateway with custom domain names, IAM auth, and a self-service registration flow.
API GATEWAYCUSTOM DOMAINSIAM AUTHINTERNAL APIs
40
INTERNAL URLS REMEMBERED
NICE NAMES, NOT IPs
READ
77
CRESTA · 2025 · LANDING ZONE
One observability plane across thirty accounts.
A logistics company had thirty production accounts and an on-call rotation that toggled between four separate Grafana instances depending on the alert. We unified observability with CloudWatch cross-account sharing and a single Grafana fronted by IAM Identity Center.
CLOUDWATCHCROSS-ACCOUNTGRAFANAOBSERVABILITY
−100%
ON-CALL TOOL SWITCHES
PER INCIDENT
READ
78
STERLING · 2024 · RELIABILITY
Cross-region DR that survives a production hour, not just a drill.
A publicly-traded SaaS company had a documented DR plan and had never actually tested it under real traffic. Their auditors had stopped accepting "we have a runbook" as evidence. We rebuilt the DR posture around Aurora Global Database and ran four real cutovers with paying customers on the line.
DRAURORA GLOBALPROD-READINESSMULTI-REGION
4m
RTO
CUSTOMER-MEASURED
READ
79
SOLSTICE · 2024 · PLATFORM
Feature flags, on AWS, at a tenth of the price.
An edtech platform had been on LaunchDarkly for three years, paying $9,400/mo for a usage tier that mostly covered seats they didn’t need. The flag features they actually used were a subset. We migrated to AWS AppConfig with a thin SDK and got the bill to $920/mo without losing capability.
APPCONFIGFEATURE FLAGSLAUNCHDARKLYCOST
−90%
FF PLATFORM COST
$9.4K → $0.9K / MONTH
READ
80
BRISTLE · 2024 · PLATFORM
CI throughput that scales with the team, not against it.
A fintech with 60 engineers had GitHub Actions throughput problems. Peak hour saw 40-minute queue times before a job even started running. The bill on GitHub-hosted runners was $14k/month. We migrated to self-hosted runners on EC2 Spot with intelligent autoscaling.
GITHUB ACTIONSSELF-HOSTEDPARALLELISMEC2 SPOT
< 30s
QUEUE TIME (PEAK)
WAS 40m
READ
81
VELVET · 2024 · MIGRATION
Auth0 to Cognito, with social logins and password resets intact.
A consumer marketplace was paying Auth0 $11,200/month for an authentication service that did mostly what Cognito does for less than $400/month. We migrated 480k active users without forcing a single password reset.
AUTH0 → COGNITOIDENTITYSOCIAL LOGINCOST
−96%
AUTH BILL
$11.2K → $0.4K / MONTH
READ
82
QUARTZ · 2024 · COST
Redshift, sized for what the team actually queries.
An analytics platform had been on Redshift DC2 nodes since 2019. The cluster was provisioned for the peak Monday-morning load and ran at 11% utilisation the rest of the week. We migrated to RA3 with managed storage, added Reserved capacity at the right tier, and let Concurrency Scaling absorb the peaks.
REDSHIFTRA3RESERVEDCONCURRENCY SCALING
−55%
CLUSTER COST
$34K → $15K / MONTH
READ
83
HOLLOW · 2024 · MIGRATION
Splunk to OpenSearch, without losing the SOC team.
A fintech company had a $1.4M annual Splunk contract up for renewal at a 22% price increase. The security operations team depended on Splunk’s search ergonomics. We migrated to OpenSearch + OpenSearch Ingestion, preserving the SPL → DSL translation work for the queries that mattered.
SPLUNKOPENSEARCHINGESTIONCOST
−83%
ANNUAL LICENCE
$1.4M → $240K
READ
84
SENTRY · 2024 · COST
Logs we retain for a year, queried for the price of S3.
A B2B SaaS company had been paying Datadog for full-fidelity log indexing across the entire fleet, with 90-day retention. The logs cost $42k/month. We split the path: keep the recent 14 days in Datadog for incident response, archive everything to S3 with Athena for the long-tail forensics.
LOG ARCHIVEATHENADATADOGCOST
−74%
LOGS BILL
$42K → $11K / MONTH
READ
85
SANDPIPER · 2024 · COST
CI on Spot, with the wallet to prove it.
An open-source vendor ran GitHub Actions on a fleet of EC2 self-hosted runners — entirely on-demand, sized for peak. Off-peak utilisation was 12%. We rebuilt the runner fleet on Spot with Karpenter-driven scaling, and brought CI compute spend down 82%.
CI/CDSPOTGITHUB ACTIONSAUTO SCALING
−82%
CI COMPUTE BILL
$18K → $3.2K / MONTH
READ
86
PEARL · 2024 · RELIABILITY
Customers in Asia who get the Asia region, automatically.
A B2C mobile app served customers across five continents from a single US-East-1 deployment. Asian customers had been quietly complaining about latency for a year. We added a multi-region active-active deployment with Route 53 latency-based routing.
ROUTE 53 LATENCYMULTI-REGIONGLOBALCDN
−78%
APAC p95 LATENCY
740ms → 160ms
READ
87
MORTAR · 2024 · PLATFORM
Dependency upgrades that just appear, ready to merge.
A B2B SaaS company had 90 repos with dependencies that drifted out of date monotonically. The annual "we need to upgrade everything" project was a known horror. We rolled out Renovate with sensible defaults and let upgrades flow continuously.
RENOVATEDEPENDENCY UPGRADESAUTOMATIONSECURITY
< 14d
OUTDATED DEPS (MEDIAN AGE)
WAS 18 MONTHS
READ
88
PARALLEL · 2024 · MIGRATION
Off Heroku, onto a bill that scales.
A B2B marketplace had outgrown Heroku — both the bill ($38k/month for Performance dynos and Heroku Postgres) and the operational ceiling. We moved a six-year-old Rails monolith plus three microservices to ECS Fargate and Aurora Postgres without rewriting the deploy pipeline.
HEROKU → AWSECS FARGATERAILSRDS
−66%
INFRASTRUCTURE BILL
$38K → $13K / MONTH
READ
89
MARLOW · 2024 · LANDING ZONE
Sandbox accounts that clean themselves.
An engineering org had 140 sandbox AWS accounts and a $42k/month sandbox bill. Most of the spend was in twelve accounts whose owners had left the company. We built a lifecycle pipeline that watches for ownership, sets budgets, and decommissions silently.
SANDBOXBUDGETSLAMBDAAUTO-CLEANUP
−68%
SANDBOX BILL
$42K → $13K / MONTH
READ
90
ASPEN · 2024 · SECURITY
EU customer data that stays in the EU, demonstrably.
A B2B SaaS expanding into Germany hit a Data Processing Addendum requirement: EU customer data must be stored, processed, and backed up exclusively in EU regions, with cryptographic enforcement and verifiable evidence. We re-architected the data plane for verifiable residency.
GDPRDATA RESIDENCYEU-CENTRAL-1KMS
100%
DATA-RESIDENCY CONFIDENCE
CRYPTO-ENFORCED
READ
91
FORGE · 2024 · COST
DynamoDB on-demand was cheaper. Until it wasn’t.
A gaming backend ran 140 DynamoDB tables, all on on-demand capacity, because the team had read "start with on-demand" three years ago and never revisited. Half the tables had stable, predictable traffic. The DynamoDB bill was $48k a month. We rebalanced and brought it to $19k.
DYNAMODBCAPACITYGSICOST
−60%
DYNAMO BILL
$48K → $19K / MONTH
READ
92
REVERB · 2024 · MIGRATION
Off Magento, onto something the team can actually maintain.
A speciality e-commerce site ran on Magento 2 for seven years, accumulating 140 third-party modules and a deployment process nobody trusted. We migrated to a headless commerce architecture on AWS with Next.js on the front and commercetools handling catalog + checkout.
MAGENTOHEADLESSNEXT.JSCOMMERCETOOLS
−61%
PAGE LOAD (p75)
4.2s → 1.6s
READ
93
SAFFRON · 2024 · LANDING ZONE
A GovCloud footprint, established before the contract started.
A federal subcontractor had been awarded a contract requiring all workload data to reside in GovCloud (US) by Q1. The team had never operated in GovCloud and the AWS account-vetting process was already in flight. We delivered the GovCloud landing zone, the ITAR controls baseline, and a working pilot workload in eight weeks.
GOVCLOUDFEDRAMPITARCONTROL TOWER
ON TIME
TIME TO CONTRACT START
Q1 DEADLINE HIT
READ
94
MARQUEE · 2024 · SECURITY
Egress inspection that actually stops the data exfil chains.
A crypto exchange had egress inspection from a legacy third-party appliance that handled traffic-by-IP but did not understand TLS-encrypted command-and-control patterns. After a near-miss with a compromised dependency, we deployed AWS Network Firewall with managed rule groups and got coverage that matched the modern threat model.
NETWORK FIREWALLIDS/IPSMANAGED RULESEGRESS
1,143
C2 DOMAINS BLOCKED
IN FIRST 30 DAYS
READ
95
PLUME · 2024 · SECURITY
Threat-hunting that the SOC can actually finish before their shift ends.
An online gaming company’s SOC was investigating GuardDuty findings by hand — pulling CloudTrail, VPC Flow Logs, and DNS data into Athena queries and assembling the picture manually. A medium-severity investigation took two hours. We rolled out Amazon Detective with its prebuilt investigation graphs.
DETECTIVEGUARDDUTYTHREAT HUNTINGINVESTIGATION
2h → 18m
INVESTIGATION TIME
MEDIAN
READ
96
CREST · 2024 · COST
Batch compute on Spot, with interruption you don’t notice.
A drug discovery company ran molecular dynamics simulations on AWS Batch — 6,000 vCPU-hours a day, all on-demand because earlier Spot attempts had been "too unstable." We rebuilt the Spot Fleet with proper diversification, capacity-optimised allocation, and a tolerant job runner.
EC2 SPOTSPOT FLEETBATCHDIVERSIFICATION
−78%
COMPUTE COST
VS ON-DEMAND
READ
97
HAWK · 2024 · MIGRATION
Image hosting, brought back inside the AWS account.
An online publishing company paid Cloudinary $14,800/month for image hosting and on-the-fly transformations. The application was AWS-hosted; every image request went out of the AWS network and back. We replicated the transformation capability on CloudFront + Lambda@Edge and moved storage to S3.
CLOUDINARY → S3CLOUDFRONTIMAGE OPTIMISATIONCOST
−78%
IMAGE HOSTING BILL
$14.8K → $3.2K / MONTH
READ
98
DAYBREAK · 2024 · RELIABILITY
Bad deploys that roll themselves back.
A SaaS commerce platform had a deploy procedure where a human watched dashboards for ten minutes after each deploy to decide if it had gone well. About once a month they missed something subtle and customers noticed. We enabled the ECS deployment circuit breaker with appropriate health alarms.
ECSDEPLOYMENT CIRCUIT BREAKERCANARYAUTO-ROLLBACK
< 2m
BAD-DEPLOY DETECTION
WAS UP TO 10
READ
99
PEBBLE · 2024 · PLATFORM
Database migrations the team trusts, even at 2am.
A B2B SaaS company ran database migrations through a homegrown shell-script orchestration that occasionally failed in surprising ways. Production migrations had become a five-engineer ceremony. We migrated to Atlas + Flyway with a proper schema-change workflow.
DATABASE MIGRATIONSFLYWAYATLASPOSTGRES
5 → 1
MIGRATION CEREMONY
ENGINEERS PRESENT
READ
100
COVE · 2024 · PLATFORM
A preview environment per pull request, without bankrupting the team.
An education tech company wanted preview environments for every pull request — but their first attempt had spun up a full-stack copy per PR and blew the budget in two weeks. We rebuilt it as namespace-scoped previews on a single shared cluster, with on-demand databases and aggressive teardown.
EPHEMERAL ENVEKSNAMESPACEPREVIEW URLS
$0.18/hr
COST PER PREVIEW
WAS $4.20/hr
READ
101
MARROW · 2024 · COST
The NAT gateway bill that nobody had been looking at.
An adtech platform’s monthly network charges had quietly grown to $34k — most of it NAT Gateway egress for traffic that should never have been leaving the VPC. We mapped the egress flows, added VPC Endpoints for the AWS-bound traffic, and used PrivateLink for the third-party flows.
NETWORKNAT GATEWAYPRIVATELINKENDPOINTS
−71%
NETWORK BILL
$34K → $10K / MONTH
READ
102
KINDRED · 2024 · MIGRATION
Forty workloads out of one data centre, before the lease ended.
A logistics tech company had a single colo with 40 production workloads, six months left on the lease, and no extension option. We ran the assessment, built the landing zone, and migrated everything with thirteen days of slack at the end.
DATACENTRE EXITMGN6RTRANSIT GATEWAY
40
WORKLOADS MIGRATED
IN 5 MONTHS
READ
103
HEARTH · 2024 · RELIABILITY
Marketing send-storms that don’t blow up the downstream.
A marketing automation product would tip its downstream over once a quarter when a customer triggered a 200,000-recipient campaign. The downstream — third-party email providers and the platform’s own webhook receivers — couldn’t absorb the burst. We introduced SQS-based load levelling with bounded concurrency.
SQSLAMBDALOAD LEVELLINGDLQ
−100%
DOWNSTREAM FAILURES
BURST-INDUCED
READ
104
RESIN · 2024 · SECURITY
A lockdown SCP rehearsed before the incident.
A crypto custody firm’s incident response runbook said "in case of confirmed breach, lock down the affected accounts." Nobody had ever tested how. We built an emergency-lockdown SCP, rehearsed it in a tabletop exercise, and added it to the response playbook with a documented activation path.
SCPINCIDENT RESPONSEEMERGENCY LOCKDOWNBREAK-GLASS
< 90s
ACTIVATION TIME
REHEARSED
READ
105
RAMBLE · 2024 · COST
Lambda memory tuned for cost, not for vibe.
A logistics tech company had 320 Lambda functions, all sized at 1024 MB because that was the company default someone had set in 2019. Some functions were under-utilised; some were starved. We ran AWS Lambda Power Tuning across the fleet and right-sized everything.
LAMBDAPOWER TUNINGCOMPUTECOST
−41%
LAMBDA COMPUTE BILL
$12.4K → $7.3K / MONTH
READ
106
NIMBUS · 2024 · MIGRATION
A 14-year-old LDAP server, gracefully retired.
A university network ran a 14-year-old OpenLDAP server as the authentication backbone for 30+ internal applications. It worked. It was also a single point of failure with no maintainer. We migrated identity to Cognito and let LDAP retire with dignity.
LDAPCOGNITOIDENTITYAUTHENTICATION
YES
SPOF ELIMINATED
LDAP RETIRED
READ
107
WHARF · 2024 · PLATFORM
Engineering cost ownership, by team, by day.
A B2B platform had an aggregate AWS bill the CFO saw monthly and zero per-team visibility into who spent what. Cost decisions were centralised in a small platform team. We built per-team cost dashboards and gave the teams the data.
COST DASHBOARDSTEAM-LEVELCURFINOPS
DAILY
PER-TEAM COST VISIBILITY
WAS QUARTERLY (NEVER)
READ
108
MERIDIAN · 2024 · LANDING ZONE
One account, six years of debt, refactored without an outage.
A profitable e-commerce platform had been running everything in a single AWS account since 2018. Production, staging, dev, marketing experiments, the founder’s side project — all in one IAM blast radius. We split it across an organisation without a single hour of downtime.
ORGANIZATIONSIAMREFACTORBLAST RADIUS
−87%
BLAST RADIUS
PROD ISOLATED FROM EVERYTHING
READ
109
ASPER · 2024 · SECURITY
Privileged sessions on the record, by default.
A defence contractor had to demonstrate to a government auditor that every privileged shell session on a production host was logged with full transcript. They had been doing this through screen-recording on Workspaces. We replaced it with SSM Session Manager logging and a tamper-evident archive.
SSM SESSION MANAGERAUDITCLOUDTRAILKMS
100%
SSH KEYS RETIRED
NO MORE BASTION
READ
110
KNOT · 2024 · RELIABILITY
Outages we hear about from monitoring, not customers.
A B2B logistics platform had had three outages in twelve months where customers reported the problem before monitoring did. The platform had monitoring — Prometheus, CloudWatch alarms, the works — but it monitored the components, not the customer-perceptible behaviour. We added CloudWatch Synthetics canaries running the critical user journeys every minute.
CLOUDWATCH SYNTHETICSCANARYSLOMONITORING
< 2m
MTTD (CUSTOMER JOURNEY)
WAS UP TO 38 MIN
READ
111
ONYX · 2024 · LANDING ZONE
A self-service catalog the central team isn’t a bottleneck for.
A manufacturing company’s cloud team was running infrastructure as a ticket queue. Every new database, every new VPC, every new IAM role was a ticket. We launched AWS Service Catalog with curated portfolios and let the engineering teams self-serve from approved patterns.
SERVICE CATALOGPORTFOLIOSCFNSELF-SERVICE
−81%
TICKET QUEUE
TO CLOUD TEAM
READ
112
SIENNA · 2024 · LANDING ZONE
A "lose an AZ" playbook the team has actually run.
A B2B SaaS platform claimed multi-AZ on every architecture diagram but had never tested losing one. The first chaos drill — drain a production AZ at 14:00 on a Wednesday — surfaced six different failure modes. We worked through each and turned the drill into a quarterly exercise.
AZ FAILOVERPLAYBOOKCHAOSMULTI-AZ
< 90s
AZ-LOSS RECOVERY TIME
CUSTOMER-MEASURED
READ
113
LEDGER · 2024 · SECURITY
Lambda secrets that never landed in an environment variable.
An accounting SaaS company had 340 Lambda functions, most of them with credentials in environment variables. The credentials were Parameter Store references, sure — but the resolved values were sitting in `aws lambda get-function-configuration` output, readable by anyone with `lambda:GetFunctionConfiguration`. We moved every secret out of env vars and into the Lambda extension pattern.
LAMBDASECRETS MANAGERPARAMETERS EXTENSIONROTATION
0
SECRETS IN ENV VARS
WAS 340 FUNCTIONS
READ
114
FORAGER · 2024 · SECURITY
Transactional email that lands, and that nobody can spoof.
A B2B SaaS company sent transactional email through SES with DKIM half-configured and no DMARC policy. Phishing emails impersonating their domain had hit two enterprise customers. We rolled out full DKIM + DMARC enforcement at p=reject, with a careful warm-up.
SESDKIMDMARCEMAIL AUTHENTICATION
p=reject
DMARC POLICY
FULL ENFORCEMENT
READ
115
BIRCH · 2024 · SECURITY
A Singapore region, opened in eight weeks.
A B2B SaaS company expanding into APAC had two enterprise customers conditioning their renewal on Singapore data residency. The team had never operated outside us-east-1 and us-west-2. We built the ap-southeast-1 deployment, retrofitted the application for region-aware routing, and certified it in time for the renewal.
DATA RESIDENCYAP-SOUTHEAST-1EXPANSIONCOMPLIANCE
8w
TIME TO REGION
KICKOFF → PRODUCTION
READ
116
WEND · 2024 · COST
Logs we keep forever, on storage that knows we’re lying.
A B2B platform had three S3 buckets holding CloudTrail logs, VPC Flow Logs, and ALB access logs. All three were on Standard storage, indefinitely. The buckets had grown to 380TB and were costing $9k/month. We applied lifecycle policies sized to the actual access pattern.
S3 LIFECYCLEGLACIERCLOUDTRAIL LOGSCOST
−86%
LOG STORAGE BILL
$9K → $1.3K / MONTH
READ
117
TROVE · 2024 · RELIABILITY
Workflows that fail well, not just fail.
A document workflow company had 30 Step Functions workflows where the error-handling pattern was "if anything fails, the workflow fails." Failures landed in CloudWatch and waited for a human. We refactored each workflow with proper retry, catch, and DLQ patterns.
STEP FUNCTIONSERROR HANDLINGRETRYDLQ
−84%
WORKFLOW FAILURE RATE
CUSTOMER-VISIBLE
READ
118
UNDERTOW · 2024 · PLATFORM
Feature flags that retire themselves when they should.
A B2C streaming company had 480 active feature flags in their feature-flag service. About 60% had been at 100% rollout for over a year — flags that should have been removed but were technical debt nobody scheduled. We built a governance layer that flagged stale flags and automated the cleanup.
FEATURE FLAGSGOVERNANCEAPPCONFIGCLEANUP
247
STALE FLAGS RETIRED
IN 90 DAYS
READ
119
KESTREL · 2024 · MIGRATION
Oracle to Aurora Postgres, no leftover PL/SQL.
An order management platform had a 2.4TB Oracle database with 380 stored procedures and a five-figure monthly licence. We migrated to Aurora Postgres with Schema Conversion Tool plus a careful refactor of the PL/SQL — and finished the engagement with zero remaining Oracle dependencies.
ORACLEAURORA POSTGRESDMSSCT
$340K/yr
LICENCE SAVINGS
NET OF AURORA COST
READ
120
FORTE · 2024 · RELIABILITY
Read replicas the application actually uses.
A B2B analytics platform had four Aurora read replicas that the application sent zero traffic to. Every query went to the writer. The writer had been scaled up four times in two years. We introduced RDS Proxy with read-only endpoints and the application started using the replicas the next day.
RDS PROXYAURORAREAD-REPLICACONNECTION POOLING
−63%
WRITER LOAD
CPU + IOPS
READ
121
VESPER · 2024 · LANDING ZONE
Two engineering orgs, one AWS organisation, zero customer surprises.
A health insurance tech company acquired a smaller competitor with eleven production AWS accounts and a different identity provider. We merged the smaller org into the larger one, unified identity through Identity Center, and didn’t cause a single customer-visible incident.
M&AACCOUNT MOVEORGANIZATIONSINTEGRATION
0
CUSTOMER-VISIBLE INCIDENTS
DURING 16-WEEK MERGE
READ
122
CUMULUS · 2024 · SECURITY
On-prem to AWS, with two paths that both work.
A healthcare payer ran Direct Connect from their on-prem data centre to AWS as a single physical path. A maintenance window from the carrier had caused a six-hour business outage. We added PrivateLink-over-internet as a hot standby and rehearsed the failover.
PRIVATELINKDIRECT CONNECTFAILOVERHYBRID
2
PATHS AVAILABLE
DX + PRIVATELINK
READ
123
ZEPHYR · 2024 · MIGRATION
Twilio for SMS, SES for email, both for less.
An on-demand services platform was using Twilio for both SMS and transactional email at $24,400/month. Half of that was email — a use case Twilio happens to support but isn’t especially cheap at. We kept Twilio for SMS (where they’re strong) and moved email to SES.
TWILIO → SESSNSTRANSACTIONAL MESSAGINGCOST
−42%
MESSAGING BILL
$24.4K → $14.2K / MONTH
READ
124
JOLLY · 2024 · MIGRATION
Forty Jenkins pipelines, off two on-prem servers in twelve weeks.
An aerospace contractor ran 40 Jenkins pipelines on two on-prem servers that had become a single point of failure (and a single point of CVE management). We migrated to AWS CodeBuild with a careful pipeline translation, and retired the Jenkins boxes.
JENKINSCODEBUILDON-PREM → AWSCI
AWS-MANAGED
CI INFRASTRUCTURE
NO ON-PREM SERVERS
READ
125
STOIC · 2024 · PLATFORM
Old services that retire on schedule, not on incident.
An enterprise SaaS company had 23 deprecated services still running because no one had a clean process for decommissioning. Two of them caused incidents in a year. We built a service-deprecation framework with timelines, callers-inventory tracking, and an automated decommission flow.
DEPRECATIONAPI LIFECYCLEGOVERNANCEBACKSTAGE
17
SERVICES DECOMMISSIONED
ON SCHEDULE
READ
126
DELTA · 2024 · SECURITY
Vulnerability management that scales to two hundred accounts.
A healthcare ISV ran 200 production AWS accounts across customer-isolated environments. Their vuln management was a quarterly export from Inspector, manually triaged in a spreadsheet by one person, with a P0-to-patch median of fourteen days. We rebuilt it as a continuous workflow with a sub-72-hour P0 SLA.
INSPECTORPATCH MGMTSSMEKS SECURITY
14d → 56h
P0 TIME-TO-PATCH
MEDIAN
READ
127
HAWTHORN · 2024 · LANDING ZONE
CloudWatch metrics, streamed centrally, queryable everywhere.
A B2B platform had observability per account — each team kept its own CloudWatch dashboards in its own account. Cross-account incident correlation took an engineer half a day per incident. We turned on CloudWatch Metric Streams across the org and landed everything in a central Prometheus-compatible store.
METRIC STREAMSCLOUDWATCHOBSERVABILITYORG
< 2m
CROSS-ACCOUNT CORRELATION
WAS HALF A DAY
READ
128
WISP · 2024 · RELIABILITY
Aurora maintenance windows that the team rehearses.
A fintech had been treating Aurora minor-version upgrades and maintenance windows as a "fingers crossed" event — sometimes they were fine, sometimes a workload broke. We instituted quarterly rehearsals against a clone of production using Aurora’s blue/green deployment feature.
AURORAREBOOTMAINTENANCEREHEARSAL
0
UNPLANNED INCIDENTS
POST-INSTITUTING REHEARSALS
READ
129
VELLUM · 2024 · PLATFORM
A service catalog the platform team didn’t have to nag people to update.
An enterprise SaaS company had a Backstage installation with 12 services registered out of an actual 73 in production. Engineers had been asked to register; nobody had. We rebuilt the registration as a build-time emission, so services registered themselves on first push.
BACKSTAGESERVICE CATALOGTECHDOCSMETADATA
73 / 73
CATALOG COVERAGE
ALL SERVICES
READ
130
ALMANAC · 2024 · RELIABILITY
SNS deliveries that don’t silently vanish.
A notification platform fanned messages out through SNS to dozens of downstream subscribers. When a subscriber endpoint failed, SNS would retry briefly and then drop. The platform’s customers were quietly losing notifications. We added DLQs and proper delivery monitoring across every topic.
SNSDLQDELIVERY RETRYOBSERVABILITY
−100%
SILENT DROPS
DLQ-CAUGHT NOW
READ
131
IVY · 2024 · PLATFORM
DORA metrics, computed from systems engineering already uses.
A fintech CTO wanted DORA metrics — deploy frequency, lead time, MTTR, change failure rate — without standing up a separate observability vendor. We built the dashboard from GitHub Actions, CloudWatch, and PagerDuty data the team was already producing.
DORAMETRICSOBSERVABILITYENGINEERING PRODUCTIVITY
4 DORA + 6 CUSTOM
METRICS PUBLISHED
CTO DASHBOARD
READ
132
PLENUM · 2024 · LANDING ZONE
"What do we own and where?", answered by a search box.
A healthcare platform had grown to 40 accounts and nobody could answer simple inventory questions ("how many Lambdas across the org," "which accounts have RDS in eu-west-1") without spending half a day on aggregated CLI scripts. We deployed AWS Resource Explorer with a cross-account aggregator index.
RESOURCE EXPLORERINVENTORYTAGGINGDISCOVERY
< 10s
INVENTORY-QUESTION TIME
WAS 2–4 HOURS
READ
133
QUENCH · 2024 · COST
Step Functions Express, where Standard was overkill.
A marketing automation company ran 14 Standard Step Functions workflows for short, high-volume orchestration tasks. They were paying for state-transition cost they didn’t need. We migrated the right workflows to Express and dropped Step Functions cost 62%.
STEP FUNCTIONSEXPRESSWORKFLOWSCOST
−62%
STEP FUNCTIONS BILL
$8.4K → $3.2K / MONTH
READ
134
GRAVEL · 2024 · RELIABILITY
SLOs for batch jobs, not just synchronous APIs.
A data analytics company had SLOs for their synchronous API endpoints but nothing equivalent for their 22 batch pipelines. Pipeline freshness, completeness, and latency were operationally important but not measured. We introduced batch SLOs against freshness and completeness, with burn-rate alerting.
SLOBATCHPIPELINESFRESHNESS
22 / 22
BATCH SLO COVERAGE
ALL PIPELINES
READ
135
LOON · 2024 · MIGRATION
Eight years of Confluence, ported to Notion without breaking links.
A media company had eight years of organisational knowledge in self-hosted Confluence, with a search experience the team had given up on. We migrated 14,400 pages to Notion with link integrity preserved and an S3-backed archive of the original Confluence export.
CONFLUENCENOTIONKNOWLEDGE BASEMIGRATION
14,400
PAGES MIGRATED
WITH HIERARCHY
READ
136
TIMBER · 2024 · RELIABILITY
Chaos engineering that the on-call team actually wanted.
A streaming platform had monthly post-incident reviews that were starting to repeat themselves. The same three failure modes kept resurfacing. We introduced a chaos engineering practice that the on-call team welcomed — because the experiments were aimed at the things they were already worried about, not arbitrary fault injection.
CHAOS ENGINEERINGFISGAME DAYSSLO
3 → 0
RECURRING FAILURE MODES
IN 90 DAYS
READ
137
QUILL · 2024 · LANDING ZONE
A private CA hierarchy that engineers can actually use.
An industrial IoT company had a homegrown CA running on a single EC2 instance, with a 4096-bit private key on an EBS volume nobody had rotated in three years. Every new device type required a manual signing ceremony. We replaced it with ACM Private CA hierarchy and a self-service signing API.
ACM PRIVATE CAmTLSCERTIFICATESIOT
0
CA INCIDENTS
180 DAYS POST-CUTOVER
READ
138
NORTHWIND · 2024 · MIGRATION
Twenty-eight workloads off on-prem, in fourteen weeks.
A logistics SaaS with a single data centre lease ending in five months. Twenty-eight production workloads. Half of them critical, half of them undocumented. We assessed every one with the 6R framework and shipped the migration in four phased waves.
MIGRATION6RVMWAREDMS
28
WORKLOADS
PRODUCTION, MIGRATED
READ
139
ZENITH · 2024 · RELIABILITY
SLOs that survive contact with quarterly planning.
A B2B logistics platform had monitoring, dashboards, and a "99.9% uptime" promise on their marketing site. They had no SLOs, no error budgets, and no way to make engineering trade-offs against reliability. We rolled out an SLO framework that survived its first quarterly planning cycle.
SLOSLIERROR BUDGETSOBSERVABILITY
17 → 28
SERVICES WITH SLOs
CRITICAL PATH COVERED
READ
140
PIVOT · 2024 · LANDING ZONE
Cost allocation that finance trusts because tags actually exist.
A retail e-commerce company had been promising the finance team a per-team cost report for two years. The blocker was always the same: tag coverage hovered around 60% and what existed wasn’t consistent. We rolled out org-level Tag Policies plus SCP enforcement, with a six-week amnesty for backfill.
TAG POLICIESCOST ALLOCATIONORGANIZATIONSSCP
99.6%
TAG COVERAGE
WAS 58%
READ
141
ECHO · 2023 · PLATFORM
Day one to first PR, in under an hour.
A B2B SaaS company had a multi-day developer onboarding — provision laptop, request AWS access, get cloned into 14 GitHub repos, install three different CLI tools, set up the dev environment. New hires routinely took a week to ship their first PR. We automated the path from "first day" to "first PR" down to under an hour.
ONBOARDINGAUTOMATIONBACKSTAGEIAM IDENTITY CENTER
47m
TIME TO FIRST PR
MEDIAN, NEW HIRES
READ
142
VISTA · 2023 · RELIABILITY
Cascading failures that stop at the boundary.
A travel booking platform had an architecture where any third-party API slowdown cascaded into a full-platform incident. Hotel-search outages caused car-rental outages caused payment outages. We rolled out circuit breakers, bulkheads, and timeouts across the boundary calls.
CIRCUIT BREAKERRESILIENCE4JBULKHEADTIMEOUTS
0
CASCADING INCIDENTS
120 DAYS POST-ROLLOUT
READ
143
TRUSS · 2023 · COST
DynamoDB reads cached in front, capacity dialled down behind.
A gaming backend ran a read-heavy DynamoDB table at provisioned 80k RCU. Most of the reads were repeated within a few seconds — game session state polled every 500ms. We put DynamoDB Accelerator (DAX) in front and dropped the RCU floor to 12k.
DYNAMODBDAXCACHEREAD-HEAVY
−71%
DYNAMO BILL
$31K → $9K / MONTH
READ
144
ADDER · 2023 · MIGRATION
Marketing email that survives a million-recipient send.
A D2C retail brand sent monthly newsletters and weekly promotional campaigns to 1.2M subscribers via Mailchimp. The contract had inflated to $84k/year. We migrated to Pinpoint with SES on the send side, preserving the segmentation logic the marketing team relied on.
MAILCHIMP → PINPOINTMARKETINGTARGETINGCOST
−72%
ANNUAL MARKETING-EMAIL COST
$84K → $23K
READ
145
BRIO · 2023 · COST
CloudFront cache hit rate, doubled.
A news publisher served 9 PB/month from CloudFront with a 42% hit rate. The cache wasn’t broken — the cache keys were. Querystrings, cookies, and User-Agent variations fragmented the cache so badly that the same article was being cached as 40+ distinct objects.
CLOUDFRONTCACHEEGRESSCOST
89%
CACHE HIT RATE
WAS 42%
READ
146
LORICA · 2023 · LANDING ZONE
Untagged resources that retag themselves.
An adtech company had a 31% tag coverage problem and a finance team that had given up asking for cost-by-team reports. We deployed a Config-rule-and-Lambda-remediator combination that auto-tagged resources from CloudTrail data on creation events.
CONFIGTAG POLICIESLAMBDAAUTO-REMEDIATION
31% → 97%
TAG COVERAGE
IN 8 WEEKS
READ
147
JUNCO · 2023 · SECURITY
School logins that just work, on every district’s SSO.
An EdTech company sold to school districts, each with their own identity provider (Google Workspace, Microsoft Entra, ClassLink, a handful of district-specific SAML implementations). Their auth had been a fragile collection of district-specific code paths. We consolidated on Cognito federated identity providers.
COGNITOFEDERATED IDPsSAMLGOOGLE WORKSPACE
94%
DISTRICT SSO COVERAGE
WAS 41%
READ
148
CROWN · 2023 · MIGRATION
Email migration with deliverability that actually improved.
A B2B SaaS company had moved off SendGrid twice before — both times deliverability had degraded and they’d moved back. We did it a third time, with proper IP warming and deliverability monitoring, and got deliverability that matched SendGrid by week three and beat it by month two.
SENDGRID → SESEMAILTRANSACTIONALDELIVERABILITY
−68%
EMAIL BILL
$11K → $3.5K / MONTH
READ
149
INLET · 2023 · MIGRATION
Heroku Postgres to Aurora, before the contract auto-renewed.
A booking platform’s application had already moved off Heroku Dynos but Heroku Postgres remained — a Standard 0-plan with auto-renewal three months out, at $14k/month. We migrated the database to Aurora Postgres with DMS continuous replication and finished the cutover in ten weeks.
HEROKU POSTGRES → AURORADMSCUTOVERCOST
−71%
DATABASE BILL
$14K → $4K / MONTH
READ
150
SPIRE · 2023 · MIGRATION
On-prem Active Directory, gracefully retired.
A healthcare IT company had on-prem Active Directory serving 1,400 employees, with two domain controllers that had been "good enough" for a decade. The hardware refresh was due, the cost was rising, and the team had no appetite to renew. We migrated identity to IAM Identity Center + Microsoft Entra ID with a clean SCIM sync.
ACTIVE DIRECTORYIAM IDENTITY CENTERSCIMSSO
2 → 0
DOMAIN CONTROLLERS
RETIRED
READ
151
VELLICHOR · 2023 · SECURITY
Paywalled content that doesn’t leak through scraping.
A digital publishing platform was losing measurable subscription value to scraping. Their CloudFront distribution served paywalled PDFs over public URLs; the subscriber check happened on the page, not on the asset. We retrofitted CloudFront Signed URLs across the asset surface without breaking legitimate flows.
CLOUDFRONTSIGNED URLSPAYWALLS3
−98%
SCRAPED ASSETS
YEAR-OVER-YEAR
READ
152
OPAL · 2023 · MIGRATION
Self-hosted Redis, retired without anyone noticing.
A gaming SaaS company ran self-hosted Redis on EC2 — a six-node cluster with the operational responsibility quietly resting on one engineer. We migrated to ElastiCache for Redis with no application code changes and no observable downtime.
REDISELASTICACHECLUSTEROPERATIONAL
−92%
OPERATIONAL HOURS
WEEKLY, ON-CALL ENGINEER
READ
153
RAPID · 2023 · RELIABILITY
A DNS failover that the customer never sees.
An e-commerce platform had a hot-standby second region but had never tested a failover under real traffic. Their previous attempt at DNS failover had taken 4 minutes to converge and had pointed half the traffic at a stale endpoint. We rebuilt it around Route 53 health checks and tight TTLs.
ROUTE 53HEALTH CHECKSDNS FAILOVERMULTI-REGION
< 60s
FAILOVER TIME
CUSTOMER-MEASURED
READ
154
QUIVER · 2023 · PLATFORM
Local dev that runs what production runs.
A logistics platform had a local dev environment that used SQLite where production used Aurora, an in-memory queue where production used SQS, and no Lambda runtime at all. "Worked locally, broke in staging" was a weekly occurrence. We brought local dev to production-parity using LocalStack.
LOCAL DEVPROD PARITYDOCKER COMPOSELOCALSTACK
−84%
"WORKED LOCALLY" INCIDENTS
YEAR-OVER-YEAR
READ
155
PACER · 2023 · COST
gp2 to gp3, with the IOPS provisioned for what the workload actually needs.
An engineering analytics company had 240TB of gp2 EBS volumes — most of them oversized because gp2 couples IOPS to capacity. We migrated to gp3 with IOPS sized to observed peak, and dropped the EBS bill 38%.
EBSgp3STORAGECOST
−38%
EBS BILL
$28K → $17K / MONTH
READ
156
UMBER · 2023 · RELIABILITY
SQS messages processed once, not three times.
An identity verification platform had an SQS queue feeding a Lambda consumer with the visibility timeout set to the Lambda’s 30-second timeout. About 3% of messages were being processed two or three times because the Lambda occasionally ran longer than 30 seconds. We tuned the visibility timeout and idempotency together.
SQSVISIBILITY TIMEOUTDLQTUNING
−99%
DUPLICATE PROCESSING
IDEMPOTENT NOW
READ
157
FATHOM · 2023 · RELIABILITY
API throttling that protects the backend without punishing the user.
A mobile fitness app had API endpoints that occasionally saw runaway client behaviour — a sync bug retrying every 100ms, an off-the-shelf scraper, a bot net learning the auth pattern. Each event hammered the backend. We added API Gateway usage plans with tiered throttling.
API GATEWAYTHROTTLINGUSAGE PLANSRATE-LIMITING
−93%
BACKEND PRESSURE EVENTS
YEAR-OVER-YEAR
READ
158
LICHEN · 2023 · PLATFORM
Style and secret violations that fail at commit, not at PR review.
A devtools company had a CI pipeline that caught linting violations, formatting issues, and committed secrets — but only after the developer had pushed and waited five minutes. We rolled out pre-commit hooks with the same checks running locally in under a second.
PRE-COMMITLINTINGSECRETS DETECTIONCI
−96%
CI FAILURES (LINT/FORMAT)
CAUGHT LOCALLY
READ
159
VERNAL · 2023 · PLATFORM
The platform team’s SLA, made measurable.
A B2B SaaS platform team had an "internal SLA" with its application-team customers — uptime for shared services like the CI cluster, the artifact registry, the secrets store. The SLA was claimed; it was never measured. We built the measurement and a public-internal dashboard.
MONITORINGSLAPLATFORM TEAMINTERNAL METRICS
14
SHARED SERVICES MEASURED
PLATFORM-OWNED
READ
160
NAUTILUS · 2023 · PLATFORM
PR review meta-work, off the engineers’ plate.
An adtech company’s PR review process was 30% mechanical — checking that the right reviewers were assigned, that the CI passed, that the description mentioned a JIRA ticket. The other 70% was the actual review. We automated the mechanical part.
PR REVIEWAUTOMATIONCODEOWNERSGITHUB ACTIONS
AUTOMATED
MECHANICAL REVIEW STEPS
14 CHECKS
READ
161
FOUNDRY · 2023 · COST
Three observability stacks, one bill, one source of truth.
A travel platform had Datadog ($28k/mo), New Relic ($14k/mo), and a self-hosted Prometheus/Grafana stack on EKS ($6k/mo of compute). Three teams, three vendors, three on-call experiences. We consolidated to a single stack and saved $36k a month, without losing any monitoring capability.
OBSERVABILITYCONSOLIDATIONCOSTCLOUDWATCH
−75%
TOOLING BILL
$48K → $12K / MONTH
READ
162
SPOOL · 2023 · SECURITY
Network captures from before the incident started.
A cryptocurrency exchange had had a near-miss that they could not fully reconstruct because they had no packet-level capture of the attacker traffic. We turned on VPC Traffic Mirroring against the customer-facing API tier with a 72-hour rolling retention, so the next investigation would have ground truth.
VPC TRAFFIC MIRRORINGFORENSICSPCAPINCIDENT RESPONSE
72h
PCAP RETENTION
ROLLING, EVERY API REQ
READ
163
KINDLING · 2023 · MIGRATION
Self-hosted GitLab, retired without losing a commit.
A fintech ran self-hosted GitLab on a 24-core EC2 instance with a Postgres backend, paying for the licence plus operating the infrastructure. The team’s opinion had quietly shifted to "let GitHub Enterprise host it." We migrated 312 repos, 40k issues, and 14 CI pipelines, lost nothing.
GITLAB → GITHUBGHASELF-HOSTEDOPERATIONAL
0
GITLAB OPERATIONAL HOURS
WAS ~8h / WEEK
READ
164
XENITH · 2023 · RELIABILITY
RDS minor versions that update themselves.
A healthcare claims company had 28 RDS instances across the org, with minor versions ranging from 2 to 14 versions behind. Audit had flagged it. We rolled out automated minor-version upgrades with a tiered cadence and prerequisite rehearsals.
RDSMINOR VERSIONAUTO-UPGRADEMAINTENANCE
28 / 28
INSTANCES ON LATEST MINOR
IN-WINDOW MAINTENANCE
READ
165
DRIFT · 2023 · MIGRATION
MongoDB Atlas to DocumentDB, with the apps unchanged.
A mobile app backend with 14 services talking to MongoDB Atlas had a $9,400/mo cluster bill, plus egress charges as the application moved more workload to AWS. We migrated to DocumentDB compatible mode, preserving the MongoDB driver code in the apps.
MONGODB ATLASDOCUMENTDBDMSCOST
−56%
DATABASE COST
$9.4K → $4.1K / MONTH
READ
166
APEX · 2023 · LANDING ZONE
Three AWS organisations merged, one finance report.
A private equity acquisition brought together three AWS organisations from three distinct portfolio companies. Each had its own payer account, its own EDP commitment, and its own tagging conventions. We consolidated them under a single payer while preserving each company’s budget identity.
M&ABILLINGORGANIZATIONSCOST
$2.8M
EDP COMMITMENT BLENDED
CONSOLIDATED COMMITMENT
READ
167
TALLY · 2023 · LANDING ZONE
A surprise on the AWS bill, every month, that nobody minds.
A software vendor’s finance team got the AWS bill on the third of every month and the engineering team got the angry email on the fourth. We deployed Cost Anomaly Detection at the org level with detectors scoped per team, and the angry email stopped arriving.
COST ANOMALY DETECTIONBUDGETSCHARGEBACKFINOPS
−95%
SURPRISE BILL EVENTS
YEAR-OVER-YEAR
READ
168
BISTRO · 2023 · RELIABILITY
A primary origin that can fail, with a secondary already ready.
A restaurant booking platform served static fallback content for their app when the dynamic API was down — a "we’re experiencing issues, check back soon" page. Until the API actually went down, when it turned out the fallback wasn’t wired to anything. We added CloudFront Origin Failover.
CLOUDFRONTORIGIN FAILOVERS3ALB
GRACEFUL
OUTAGE EXPERIENCE
PAGE INSTEAD OF 503
READ
169
JUNIPER · 2023 · PLATFORM
On-call tooling, migrated without a missed page.
A healthcare SaaS company had Opsgenie for on-call routing with an expiring contract and a vendor-direction shift the team didn’t want to follow. We migrated to PagerDuty over six weeks, with both systems live in shadow during the cutover.
OPSGENIE → PAGERDUTYON-CALLINCIDENTMIGRATION
0
MISSED PAGES (MIGRATION WINDOW)
DUAL-ROUTING
READ
170
BRAMBLE · 2023 · LANDING ZONE
Audit logs the regulator can’t accidentally edit.
A regional bank had CloudTrail enabled, logs landing in S3, and a regulator who had started asking how the team could prove the logs hadn’t been tampered with. The honest answer was "trust." We rebuilt the audit log archive with S3 Object Lock in compliance mode and a clean chain of custody.
S3 OBJECT LOCKCLOUDTRAILCOMPLIANCEWORM
7y
TAMPER-PROOF RETENTION
REGULATOR REQUIREMENT
READ
171
FALCON · 2023 · COST
Field-level encryption that the audit team likes and finance can afford.
An insurance broker terminated PII fields client-side, sending them as separately-encrypted blobs to a back-end decryption service. The architecture worked but the operational cost of the decryption service was high. We replaced it with CloudFront Field-Level Encryption.
CLOUDFRONT FIELD-LEVEL ENCRYPTIONPIICERTIFICATE COST
RETIRED
DECRYPTION SERVICE BILL
$8K/mo SAVED
READ
172
QUARRY · 2023 · RELIABILITY
On-call runbooks that the next person on rotation can actually use.
A B2B SaaS platform had eighteen services, eighteen different on-call rotations, and eighteen different runbook formats — most of them outdated or missing. New rotation members spent their first quarter in survival mode. We standardised the runbook format and the on-call onboarding.
RUNBOOKSON-CALLDOCUMENTATIONINCIDENT RESPONSE
< 2w
NEW ROTATION RAMP-UP
WAS 8–12 WEEKS
READ
173
MERCER · 2023 · LANDING ZONE
BYOL licences tracked back to the agreements that bought them.
An engineering consulting firm ran a mix of BYOL Windows workloads, SQL Server instances, and Oracle databases. They were under-utilising their Microsoft enterprise agreement and over-buying spot Windows on AWS. We rolled out License Manager with managed entitlements and tied every BYOL workload back to a tracked agreement.
LICENSE MANAGERBYOLWINDOWSCOST
+58%
BYOL UTILISATION
PRE-PAID ENTITLEMENTS
READ
174
DRIFTWOOD · 2023 · COST
NAT Gateway, replaced by an instance where the math says.
An internal IT team ran low-throughput VPCs (single-AZ test environments, internal dev clusters) where the NAT Gateway hourly cost dominated the real network usage. We replaced 14 of them with EC2-based NAT instances on t4g.nano with appropriate guardrails.
NAT INSTANCENAT GATEWAYEGRESSCOST
−83%
PER-VPC NETWORK COST
TEST/DEV ENVIRONMENTS
READ
175
KAPPA · 2023 · PLATFORM
A platform that scales past the founding team.
A seed-stage developer tools company with three engineers, shipping to ten beta customers, and a clear "Series A in six months" deadline. They needed an AWS foundation that wouldn’t embarrass them at diligence — without spending the whole runway on it.
PLATFORMTERRAFORMEKSOBSERVABILITY
0
ENGINEERS NEEDED
TO ONBOARD A NEW SERVICE
READ
176
BUOY · 2022 · COST
A petabyte of imagery, moved without paying for egress at internet speeds.
A geospatial archive had 4.8 petabytes of historical imagery in on-premise tape storage that the regulator wanted off-site by year-end. Over their existing 10Gbps internet link, the transfer would have taken 41 months. We used AWS Snowmobile (the literal truck-with-a-shipping-container) and finished in eleven weeks.
SNOWMOBILEDATA TRANSFERS3ARCHIVE
4.8 PB
DATA MOVED
ON A LITERAL TRUCK
READ

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →

A few engagements, in detail.

A $400k monthly bill, with the right commitments for once.

Retired eighty IAM users in three weeks.

An active-active payments API that runs everywhere, all the time.

Model training on Spot, checkpoint-resumed.

An infrastructure portal the application teams actually use.

Bots stopped, humans didn’t notice.

Marketplace payouts, run by Stripe, not by our compliance team.

A bad deploy that only one customer notices.

Developers who can create IAM roles, safely.

Service-to-service mTLS without a service mesh.

Twelve EKS clusters, one application surface.

Build provenance that the supply chain can actually verify.

Guardrails that fail the deploy, not the audit.

Training commitments that match the model release cadence.

Engineering metrics the CTO can show the board.

Container images you can prove the lineage of.

AI model access, scoped to who is allowed to use which.

Secrets that rotate themselves, even the third-party ones.

Guardrails that stop drift before HIPAA notices.

Backstage, but only the parts that earned their keep.

A data lake that the legal team actually signed off on.

A network that doesn’t need a senior engineer to debug.

Cut the AWS bill in half, kept the SLA.

Self-hosted observability that actually saves money.

Four petabytes of S3, half the bill.

Self-hosted Kafka, retired without losing a partition.

Two million users migrated, no password resets.

Customer keys, in the customer’s custody.

Incident playbooks that run themselves while you’re still putting on your laptop.

A real-time pipeline that paid for itself in six weeks.

SOC 2 Type II evidence that gathers itself.

Deploys that the customer doesn’t notice. Not even the canary one.

CI builds that finish before the engineer remembers they pushed.

Key custody that the regulator certifies, not promises.

Images optimised at request, not stored in twelve sizes.

Half a million open connections, no surprises.

Dev environments that match production, on the engineer’s first day.

Monorepo builds that finish in seconds, not minutes.

Karpenter, Spot, and the EKS bill that fell sixty percent.

Compute that runs the same workload for thirty percent less.

Malware caught at the volume, not at the customer.

S3 endpoints scoped to the buckets they’re allowed to talk to.

Origin Shield, sized to actually pay for itself.

Search relevance the team can tune themselves.

Critical-path Lambdas that throttle other workloads, not the customer.

Indexes that match the queries, not the wishes.

A design system the front-end team actually uses.

Eight regions, one key strategy, no manual rotations.

Observability cost, brought back to earth.

Off Cloudflare, onto the cloud the rest of the stack already lives on.

WAF in front of the application, instead of in front of the firewall.

Targets that fail the right way, not the wrong way.

No public ingress, no VPN, no internet egress from prod.

Per-tenant cost, calculated from the CUR.

Off VMware, with an Outposts cushion for the workloads that needed it.

Aurora I/O-Optimized for the workload, Standard for the rest.

HashiCorp Vault, retired in favour of the AWS-native equivalent.

Events you can replay, even from a Tuesday last quarter.

Internal docs that the new engineer can actually find.

GitOps for infrastructure, not just for Kubernetes manifests.

Forty terabytes of Spark, off GCP in nine weeks.

One network, many accounts — without VPC peering hairballs.

Encrypted replication that doesn’t need cross-region role gymnastics.

New AWS account in twelve minutes. No tickets.

Configuration drift, caught before audit found it.

PII in S3, before it lands in the wrong bucket.

Account baselines that stay applied.

Third-party SaaS, pulled inside the VPC boundary.

The four-hour Postgres outage that came back as a one-page runbook.

A region-expansion playbook that runs in four weeks, not four months.

WAF rules that drop the right traffic, not the rest.

AWS Marketplace, with procurement in the loop again.

Sharing without IAM acrobatics.

Azure to AWS, twenty-two services, no rewrite.

PCI DSS on a multi-tenant platform, without forking the cluster.

An internal API gateway that engineers prefer over service URLs.

One observability plane across thirty accounts.

Cross-region DR that survives a production hour, not just a drill.

Feature flags, on AWS, at a tenth of the price.