WORK · SELECTED CASE STUDIES
A few engagements, in detail.
Names anonymised in public copy where the client asked; numbers are real. Filter by domain or stack on the left, sort however helps, then open any brief for the full architecture and narrative.
- 01
A $400k monthly bill, with the right commitments for once.
A consumer SaaS company had a $400k/month AWS bill and a Savings Plans portfolio that someone had purchased in 2022 and never revisited. Coverage was 41%; effective rate was around 18%. We rebalanced the portfolio, paid down a chunk to Compute Savings Plans, and moved 35% of compute to Graviton.
37%
- 02
Retired eighty IAM users in three weeks.
An engineering SaaS company had eighty IAM users with long-lived access keys, half of them belonging to people who no longer worked there. We rolled out IAM Identity Center federated to Okta, migrated every human and machine identity, and deleted the IAM users without breaking a single workflow.
80 → 0
- 03
An active-active payments API that runs everywhere, all the time.
A payments processor needed an active-active multi-region architecture for their core authorisation API — not warm-standby, not failover, but real concurrent serving from both regions with sub-second cross-region writes. We rebuilt the data layer on DynamoDB Global Tables and the routing on Route 53 latency records.
< 30s
- 04
Model training on Spot, checkpoint-resumed.
A computer vision team trained models on SageMaker with on-demand p3.16xlarge instances at roughly $24/hour each. A full training run took 96 hours. Their monthly training bill ran $42k. We moved training to managed Spot with checkpoint-resume and brought it under $11k.
−74%
- 05
An infrastructure portal the application teams actually use.
An AI startup’s platform team had been the bottleneck for every new piece of infrastructure — every S3 bucket, every Aurora instance, every SQS queue. We built a self-service portal on Backstage that engineers used to provision infra against Crossplane-managed AWS resources.
−88%
- 06
Bots stopped, humans didn’t notice.
A ticketing platform was losing high-demand event ticket releases to scalper bots. Blocking outright was too aggressive (false positives killed legitimate buyers); CAPTCHA was too rude (drop-off was measurable). AWS WAF’s Challenge action — a silent client-side cryptographic puzzle — let us stop the bots without showing a CAPTCHA to humans.
−94%
- 07
Marketplace payouts, run by Stripe, not by our compliance team.
A multi-vendor marketplace had been handling vendor payouts via a homegrown ACH pipeline with manual KYC review. As the marketplace grew, the compliance overhead grew faster. We migrated to Stripe Connect Custom — Stripe handles KYC, payout rails, and 1099 reporting; we handle marketplace logic.
< 2d
- 08
A bad deploy that only one customer notices.
A SaaS platform with 800 tenants on a shared infrastructure had had two "every customer affected" outages in twelve months. We adopted a cell-based architecture with shuffle-sharded tenant placement; the next bad deploy affected one cell of 32 tenants, not all 800.
32 / 800
- 09
Developers who can create IAM roles, safely.
A developer tools company had a "no, we won’t give you IAM:CreateRole permissions" stance from the security team and a "we file two tickets a week for new IAM roles" stance from the engineering team. We resolved it with IAM permission boundaries — engineers can create roles, but only roles bounded by a security-approved policy.
−92%
- 10
Service-to-service mTLS without a service mesh.
A healthcare platform had been told they needed Istio to do mutual TLS between services. The team had tried Istio twice and walked away both times. We delivered the same security property with VPC Lattice and IAM auth in three weeks.
100%
- 11
Twelve EKS clusters, one application surface.
An AI infrastructure company ran 12 EKS clusters across regions and GPU classes. Deploying anything across the whole estate was 12 separate kubectl applies plus a spreadsheet to track them. We adopted Karmada for cluster federation and turned cross-cluster operations into a single declaration.
−83%
- 12
Build provenance that the supply chain can actually verify.
An open-source vendor shipping binaries and container images to thousands of users had had two near-misses with dependency-substitution attacks. Customers had started asking, in writing, for SLSA-level provenance. We rebuilt the build pipeline to emit SLSA L3-compliant attestations.
L3
- 13
Guardrails that fail the deploy, not the audit.
A B2B fintech enforced guardrails detectively — Config rules caught misconfigurations after they shipped. Audit kept finding ten-minute-old non-compliance. We moved the most-violated controls to CloudFormation Hooks so they fail the deploy synchronously.
−96%
- 14
Training commitments that match the model release cadence.
A computer vision company shipped a new model version every six weeks, with each release requiring 80+ hours of GPU training. The training was on-demand; the bill was lumpy and expensive. We added SageMaker-aware Savings Plans matched to the release cadence and brought average effective rate to 41% off list.
−41%
- 15
Engineering metrics the CTO can show the board.
A public SaaS CTO needed engineering health metrics for board reporting — beyond DORA. We assembled a comprehensive site-reliability metric set covering golden signals, error-budget burn, deploy quality, on-call load, and platform reliability into a single quarterly report.
14
- 16
Container images you can prove the lineage of.
A devtools company shipping container images to enterprise customers had to answer "what’s in this image" for every customer security questionnaire. The answers were assembled by hand each time. We built a supply-chain pipeline that ships SBOMs and signatures with every image.
100%
- 17
AI model access, scoped to who is allowed to use which.
An AI products company let any engineer call any Bedrock model in any region. Compliance was uncomfortable; finance was alarmed at the spend per individual experiment. We rolled out SCPs that scoped model access per OU, with named-model approvals and region restrictions.
−72%
- 18
Secrets that rotate themselves, even the third-party ones.
A travel tech platform had Secrets Manager but used it as "Parameter Store with a fancier name" — no rotation, no audit. The first audit finding said so. We turned on rotation for every secret in scope, including the third-party API keys that everyone had assumed couldn’t be rotated.
100%
- 19
Guardrails that stop drift before HIPAA notices.
A digital health company expanded from one clinical product to three, ran into HIPAA audit findings on configuration drift, and asked for guardrails strict enough to stop drift but loose enough to ship. We built a tiered Service Control Policy framework, tested it against three months of historical CloudTrail, and rolled it to ninety accounts.
−100%
- 20
Backstage, but only the parts that earned their keep.
An enterprise SaaS company had eighty services, fifty engineers, and three different ways to provision new infrastructure depending on which platform engineer you asked. We built a Backstage IDP with three golden paths — each one ten minutes from idea to running service — and resisted the temptation to add more.
10m
- 21
A data lake that the legal team actually signed off on.
A streaming media company had four analytics teams, three of them writing to the same S3 bucket with overlapping prefixes, and a legal team that had quietly stopped reading the architecture diagrams. We rebuilt the analytics estate as a producer-consumer data lake with fifteen accounts, Lake Formation governance, and one row-level access policy per consumer.
−96%
- 22
A network that doesn’t need a senior engineer to debug.
A cross-border payments company had grown from one region to four with ad-hoc VPC peering between every pair of VPCs. The mesh had 38 connections and one person who understood it. We collapsed it to a Transit Gateway spoke-and-hub per region, with inter-region peering and a route table per traffic class.
38 → 11
- 23
Cut the AWS bill in half, kept the SLA.
A Series-B fintech on a single AWS account, growing 30% MoM, with auditors three weeks out. We rebuilt the landing zone on Control Tower and shipped the SOC 2 Type II remediation work in nine weeks — without a single migration weekend.
−44%
- 24
Self-hosted observability that actually saves money.
A climate analytics company had been on a fully-managed observability vendor at $52k/mo. They had the engineering capacity to operate their own but had assumed it would be more expensive once you priced in the time. We built a self-hosted Grafana + Tempo + Loki stack on EKS and brought the all-in cost (including operational time) under $14k/mo.
−73%
- 25
Four petabytes of S3, half the bill.
A geospatial analytics company had four petabytes of raster imagery in S3, growing 50TB a month. Everything was Standard. The bill was $94k a month and rising. We mapped actual access patterns, ran a lifecycle migration, and dropped storage cost 51% without touching application code.
−51%
- 26
Self-hosted Kafka, retired without losing a partition.
A real-time bidding platform ran 28 self-hosted Kafka brokers across three AZs, with a team that had become reluctant ZooKeeper experts. We migrated to MSK Serverless for the variable workloads and MSK Provisioned for the steady ones, retired the EC2 cluster, and reclaimed the team’s attention.
−92%
- 27
Two million users migrated, no password resets.
A consumer marketplace had two million users on a custom-rolled auth service that the team didn’t want to maintain anymore. We migrated to Cognito User Pools without forcing any user to reset their password — using Cognito’s lazy-migration trigger to verify credentials against the legacy store on first sign-in.
0
- 28
Customer keys, in the customer’s custody.
A B2B SaaS company had been losing enterprise deals over a recurring objection: "we want our data encrypted with keys we control, not keys you control." We added BYOK support via AWS KMS External Key Store and unblocked $3.6M in pipeline.
$3.6M
- 29
Incident playbooks that run themselves while you’re still putting on your laptop.
A fintech’s incident response playbooks were a Notion page the on-call read while diagnosing. Steps like "restart the unhealthy ECS service" or "drain the affected AZ" were manual, with the on-call’s attention split between reading and doing. We automated the high-confidence remediation steps with SSM Automation runbooks.
< 3m
- 30
A real-time pipeline that paid for itself in six weeks.
A healthcare data company was running a near-real-time pipeline on six always-on EC2 instances behind an Application Load Balancer, processing roughly 40 events a second at peak. The bill was $8,200 a month. We rebuilt it serverless and brought it to $640.
−92%
- 31
SOC 2 Type II evidence that gathers itself.
A B2B SaaS company had passed SOC 2 Type I in 2023 and was due for Type II in 2025. The Type I prep had consumed two senior engineers for three months. We automated 78% of the evidence collection so Type II prep took two engineer-weeks total.
78%
- 32
Deploys that the customer doesn’t notice. Not even the canary one.
A consumer mobile app had a deployment process that took the API into degraded mode for 20-30 seconds during each release, three times a day. Players noticed; reviews complained. We replaced it with ALB-weighted canaries and blue/green deploys that don’t touch the served traffic shape.
0s
- 33
CI builds that finish before the engineer remembers they pushed.
A developer infrastructure company had CI pipelines averaging 14 minutes per build, dominated by dependency installation and container layer rebuilds. We added an S3-backed build cache for the dependency steps and ECR layer caching for the container builds. Median build time landed at 92 seconds.
92s
- 34
Key custody that the regulator certifies, not promises.
A financial market infrastructure firm needed FIPS 140-2 Level 3 key custody for signing trade settlement messages. KMS Level 3 hadn’t quite landed for this customer’s region. We deployed CloudHSM Cluster with a custom integration layer and got regulator sign-off in twelve weeks.
L3
- 35
Images optimised at request, not stored in twelve sizes.
A fashion e-commerce site stored each product image in twelve pre-sized variants (thumbnail, mobile, tablet, desktop, retina × 2-3 use cases). The S3 bill was dominated by image storage. We moved to on-the-fly resizing at the edge using Lambda@Edge plus CloudFront, deleting the pre-sized variants.
−82%
- 36
Half a million open connections, no surprises.
A social platform’s chat feature ran on a self-managed WebSocket gateway that fell over once it crossed 80k concurrent connections. The team had been scaling vertically (bigger instances) and praying. We rebuilt on API Gateway WebSocket API with DynamoDB-backed connection state.
520K
- 37
Dev environments that match production, on the engineer’s first day.
A B2B SaaS company onboarded new engineers with a multi-day "set up your dev environment" experience that produced different states on different laptops. We standardised on devcontainers, ran them through GitHub Codespaces for new hires, and got everyone to parity.
< 30m
- 38
Monorepo builds that finish in seconds, not minutes.
A developer tools company had a 40-package TypeScript monorepo with full-rebuild times of 14 minutes on CI and roughly 6 minutes locally. Most package changes touched two or three packages, not 40. We adopted TurboRepo with a remote cache on S3 and got incremental builds to the seconds they should be.
< 8s
- 39
Karpenter, Spot, and the EKS bill that fell sixty percent.
An AdTech company ran 14 EKS clusters with Cluster Autoscaler, a fixed set of node groups, and an on-demand-only policy. The compute bill was $180k/month and the autoscaler routinely ran clusters at 38% utilisation. We replaced it with Karpenter, opened the cluster to Spot, and pulled utilisation past 70%.
−60%
- 40
Compute that runs the same workload for thirty percent less.
A consumer publishing platform ran their entire fleet on Amazon Linux 2 with x86 instances, despite roughly two-thirds of the workloads being ARM-compatible. We migrated to Amazon Linux 2023 on Graviton-based instances, one workload class at a time.
−31%
- 41
Malware caught at the volume, not at the customer.
A manufacturing tech company had a customer-uploaded file scanning gap — uploads landed in S3 and were processed by EKS pods without antivirus inspection. After a near-miss with an infected upload, we deployed GuardDuty Malware Protection for both EBS and EKS, with an automated quarantine flow.
100%
- 42
S3 endpoints scoped to the buckets they’re allowed to talk to.
A B2B SaaS company’s S3 Gateway Endpoints were configured with the default open policy — any S3 bucket, any operation. A compromised IAM principal would have had a clean exfiltration path. We tightened the endpoint policies to a per-environment allowlist of buckets and operations.
−98%
- 43
Origin Shield, sized to actually pay for itself.
A software distribution company served 14PB/month from CloudFront with origins in S3 across two regions. Cache hit rate at the edge was good (84%); origin fetches were still expensive because they hit S3 from every edge location independently. We added Origin Shield and brought origin fetches down 73%.
−73%
- 44
Search relevance the team can tune themselves.
A job board was paying Algolia $7,600/month for search that returned 18M monthly queries against 800k job listings. The pricing tier was tight; the relevance tuning was opaque. We migrated to OpenSearch Service with a careful relevance reconciliation.
−63%
- 45
Critical-path Lambdas that throttle other workloads, not the customer.
A B2B platform had 80+ Lambda functions sharing the regional concurrency limit. A spike in batch-processing Lambdas had once throttled the customer-facing checkout flow. We added reserved concurrency for the critical path and provisioned concurrency for the latency-sensitive piece.
0
- 46
Indexes that match the queries, not the wishes.
An IoT telemetry platform’s DynamoDB tables had been designed for the queries the team imagined they would need. A year in, the actual access patterns had drifted. Read latency was up, costs were up, and one GSI was scanning 80% of the table on every query. We redesigned the indexes around the real access patterns.
−63%
- 47
A design system the front-end team actually uses.
A B2C marketplace had a Figma design system that didn’t map cleanly to the React component library, and a React component library that didn’t cleanly match what was shipping. We rebuilt around Storybook with Chromatic visual regression tests, making the design system the single source of truth.
0
- 48
Eight regions, one key strategy, no manual rotations.
A biotech research platform encrypted everything — S3, RDS, EBS, Secrets Manager — but had grown organically to 340 KMS keys across eight regions with no consistent rotation policy. Some keys were untagged. Two were orphaned. We rebuilt the key strategy from scratch.
340 → 47
- 49
Observability cost, brought back to earth.
A mobile gaming studio had a $61k/month CloudWatch Logs bill, primarily driven by Lambda execution logs at the most verbose tier, retained for 90 days "just in case." We applied a tiered retention policy, sampled the noisy log streams, and got the bill to $14k without losing investigative power.
−77%
- 50
Off Cloudflare, onto the cloud the rest of the stack already lives on.
A SaaS analytics company had Cloudflare in front of their AWS-hosted application — paying a Cloudflare bill, terminating TLS twice, and operating two CDN configurations. We migrated edge, WAF, and DNS to CloudFront + AWS WAF + Route 53 with zero customer-visible disruption.
−$8.2K/mo
- 51
WAF in front of the application, instead of in front of the firewall.
A government services provider had Imperva Cloud WAF on a separate edge-stack contract with a five-figure monthly bill. AWS WAF on CloudFront, with managed rule groups, did most of what Imperva did at a quarter of the cost. We migrated the WAF and added Shield Advanced for the DDoS protection.
−74%
- 52
Targets that fail the right way, not the wrong way.
A streaming media company had an ALB target group where unhealthy instances took 90 seconds to drain and healthy-but-slow instances kept serving traffic. The combination caused a measurable customer-impact event during deploys. We tuned the health-check parameters carefully.
−92%
- 53
No public ingress, no VPN, no internet egress from prod.
A fintech with strict regulator expectations had a VPC topology that "worked but felt wrong" — a public ALB, a third-party WAF appliance, an open egress NAT. Their CISO wanted defensible boundaries. We redesigned the production VPC for zero-trust: no public ingress, no internet egress, every service-to-service call authenticated.
0
- 54
Per-tenant cost, calculated from the CUR.
A B2B SaaS company knew their gross margin but not their per-customer margin. The largest customer was suspected to be unprofitable, the smallest was suspected to be highly profitable, and nobody could prove either. We built a cost-allocation pipeline that produces per-tenant cost reports nightly.
100%
- 55
Off VMware, with an Outposts cushion for the workloads that needed it.
A broadcast media company had 60 VMs on VMware running everything from front-office IT to broadcast control systems. Most could move to EC2 outright; a handful needed sub-millisecond latency to studio hardware. We rehosted the bulk to EC2 and ran the broadcast-adjacent workloads on AWS Outposts in their facility.
60
- 56
Aurora I/O-Optimized for the workload, Standard for the rest.
A trading systems company ran a heavy Aurora Postgres cluster — high IOPS, predictable I/O cost. The team had stayed on Aurora Standard because they didn’t know what Aurora I/O-Optimized was. We modelled the workloads and migrated the right clusters to I/O-Optimized.
−27%
- 57
HashiCorp Vault, retired in favour of the AWS-native equivalent.
A real-estate tech company ran self-hosted HashiCorp Vault as the secrets backbone for AWS-hosted workloads. Vault worked well but operating it had become a steady tax. We migrated secrets to AWS Secrets Manager and SSM Parameter Store, on a path that retired Vault entirely.
GONE
- 58
Events you can replay, even from a Tuesday last quarter.
An e-commerce platform had a Lambda consumer behind EventBridge that had silently been throwing on a malformed event for six hours during a deploy. The downstream order-update emails for that window had never been sent. We turned on EventBridge Archive and Replay so the next such incident would be a recoverable one.
90d
- 59
Internal docs that the new engineer can actually find.
An engineering org had documentation across Notion, Backstage, GitHub READMEs, and a dozen old Confluence pages nobody had migrated. Searching for an answer meant guessing where it might live. We deployed Amazon Kendra against all four sources with a unified search interface.
−68%
- 60
GitOps for infrastructure, not just for Kubernetes manifests.
A climate-tech company had GitOps for their EKS workloads via ArgoCD but managed their AWS infrastructure through a mix of Terraform, the AWS Console, and one rogue Pulumi project. We unified everything under GitOps with Crossplane — including the AWS infrastructure.
0
- 61
Forty terabytes of Spark, off GCP in nine weeks.
A marketing analytics company ran a 40TB nightly Spark pipeline on GCP Dataproc with BigQuery storage. Their largest customer’s preferred-cloud clause triggered a forced migration. We rebuilt the pipeline on EMR + Redshift Spectrum + S3 without rewriting a single transformation.
9w
- 62
One network, many accounts — without VPC peering hairballs.
A robotics infrastructure company had 19 accounts, each with its own VPC, and a peering mesh that took 90 minutes to draw on a whiteboard. We collapsed the architecture with shared VPCs via Resource Access Manager and a single Transit Gateway hub.
24 → 0
- 63
Encrypted replication that doesn’t need cross-region role gymnastics.
A digital health platform replicated PHI buckets across regions with a cross-region trust dance that nobody fully trusted. Auditors had questions. We rebuilt the encryption on KMS multi-region keys so the same key material exists in both regions, eliminating the trust path.
−100%
- 64
New AWS account in twelve minutes. No tickets.
A logistics tech company had grown to forty engineering teams. Spinning up a new AWS account took six business days and three tickets across IT, Finance, and Platform. The bottleneck was the Platform team. We replaced the process with a self-service account vending pipeline.
12m
- 65
Configuration drift, caught before audit found it.
An insurance carrier had AWS Config running in 47 accounts but the dashboard hadn’t been opened in months — and the auditors had started flagging the drift Config could have caught. We wired up an aggregator, defined a baseline of 23 conformance rules, and turned on auto-remediation for the safe ones.
−94%
- 66
PII in S3, before it lands in the wrong bucket.
An education platform discovered student PII in three S3 buckets that hadn’t been intended to hold it — uncovered by a junior engineer running an ad-hoc Athena query for an unrelated reason. We rolled out Macie across the org and built a DLP pipeline that catches new PII drops before they’re queryable.
441
- 67
Account baselines that stay applied.
An insurance carrier with 64 accounts had baselines that worked at provision time and drifted within a quarter. We rebuilt the governance layer with CloudFormation StackSets driven from the management account, with auto-update on baseline changes.
0
- 68
Third-party SaaS, pulled inside the VPC boundary.
A B2B finance company sent customer data to four third-party SaaS vendors over the public internet — analytics, observability, error tracking, fraud signals. Their security team had been quietly uncomfortable. We moved every integration where the vendor supported it to PrivateLink, with audit trails per direction.
−81%
- 69
The four-hour Postgres outage that came back as a one-page runbook.
A real-estate platform had a four-hour customer-visible Postgres outage on a Tuesday afternoon — a runaway query, a connection storm, an auto-failover that didn’t complete cleanly. We ran the postmortem, shipped the four highest-impact remediations, and turned the lessons into a one-page operations runbook.
0
- 70
A region-expansion playbook that runs in four weeks, not four months.
A consumer marketplace expanded from US-East to four additional regions over eighteen months. The first expansion took four months. We worked on the second and wrote a playbook; the third and fourth took four weeks each, with the same engineering team.
4w
- 71
WAF rules that drop the right traffic, not the rest.
A B2C subscription business had AWS WAF in front of CloudFront with three managed rule groups and 14% of legitimate signup traffic getting falsely blocked. We rebuilt the WAF configuration with custom rules, AWS Bot Control, and a careful tuning loop using sampled requests.
−93%
- 72
AWS Marketplace, with procurement in the loop again.
A defence systems integrator had 47 active AWS Marketplace subscriptions, most of them purchased by engineers with the corporate card permissions Marketplace grants. Procurement found out about each one only on the invoice. We built a governance layer with private offers, subscription approvals, and an org-level marketplace block.
0
- 73
Sharing without IAM acrobatics.
A genomics research org shared a Transit Gateway, a Glue Catalog, and three private hosted zones across nineteen accounts. The setup worked but every share required custom cross-account IAM and the platform team owned every change. We replaced the bespoke sharing with AWS Resource Access Manager.
−93%
- 74
Azure to AWS, twenty-two services, no rewrite.
A healthcare ISV with twenty-two production services on Azure (AKS, Azure SQL, App Configuration, Service Bus) needed to leave Azure inside seven months — driven by their largest customer’s "AWS-only" mandate. We migrated everything to equivalent AWS services without rewriting application code.
6m
- 75
PCI DSS on a multi-tenant platform, without forking the cluster.
A B2B payments platform needed PCI DSS Level 1 for their largest customer — but their architecture team had been told it would require a separate cluster and six months of work. We delivered it in eleven weeks on the existing EKS estate.
−72%
- 76
An internal API gateway that engineers prefer over service URLs.
An enterprise SaaS company had 40 internal services with a wall of internal load balancer URLs that nobody could remember. We built an internal API gateway with custom domain names, IAM auth, and a self-service registration flow.
40
- 77
One observability plane across thirty accounts.
A logistics company had thirty production accounts and an on-call rotation that toggled between four separate Grafana instances depending on the alert. We unified observability with CloudWatch cross-account sharing and a single Grafana fronted by IAM Identity Center.
−100%
- 78
Cross-region DR that survives a production hour, not just a drill.
A publicly-traded SaaS company had a documented DR plan and had never actually tested it under real traffic. Their auditors had stopped accepting "we have a runbook" as evidence. We rebuilt the DR posture around Aurora Global Database and ran four real cutovers with paying customers on the line.
4m
- 79
Feature flags, on AWS, at a tenth of the price.
An edtech platform had been on LaunchDarkly for three years, paying $9,400/mo for a usage tier that mostly covered seats they didn’t need. The flag features they actually used were a subset. We migrated to AWS AppConfig with a thin SDK and got the bill to $920/mo without losing capability.
−90%
- 80
CI throughput that scales with the team, not against it.
A fintech with 60 engineers had GitHub Actions throughput problems. Peak hour saw 40-minute queue times before a job even started running. The bill on GitHub-hosted runners was $14k/month. We migrated to self-hosted runners on EC2 Spot with intelligent autoscaling.
< 30s
- 81
Auth0 to Cognito, with social logins and password resets intact.
A consumer marketplace was paying Auth0 $11,200/month for an authentication service that did mostly what Cognito does for less than $400/month. We migrated 480k active users without forcing a single password reset.
−96%
- 82
Redshift, sized for what the team actually queries.
An analytics platform had been on Redshift DC2 nodes since 2019. The cluster was provisioned for the peak Monday-morning load and ran at 11% utilisation the rest of the week. We migrated to RA3 with managed storage, added Reserved capacity at the right tier, and let Concurrency Scaling absorb the peaks.
−55%
- 83
Splunk to OpenSearch, without losing the SOC team.
A fintech company had a $1.4M annual Splunk contract up for renewal at a 22% price increase. The security operations team depended on Splunk’s search ergonomics. We migrated to OpenSearch + OpenSearch Ingestion, preserving the SPL → DSL translation work for the queries that mattered.
−83%
- 84
Logs we retain for a year, queried for the price of S3.
A B2B SaaS company had been paying Datadog for full-fidelity log indexing across the entire fleet, with 90-day retention. The logs cost $42k/month. We split the path: keep the recent 14 days in Datadog for incident response, archive everything to S3 with Athena for the long-tail forensics.
−74%
- 85
CI on Spot, with the wallet to prove it.
An open-source vendor ran GitHub Actions on a fleet of EC2 self-hosted runners — entirely on-demand, sized for peak. Off-peak utilisation was 12%. We rebuilt the runner fleet on Spot with Karpenter-driven scaling, and brought CI compute spend down 82%.
−82%
- 86
Customers in Asia who get the Asia region, automatically.
A B2C mobile app served customers across five continents from a single US-East-1 deployment. Asian customers had been quietly complaining about latency for a year. We added a multi-region active-active deployment with Route 53 latency-based routing.
−78%
- 87
Dependency upgrades that just appear, ready to merge.
A B2B SaaS company had 90 repos with dependencies that drifted out of date monotonically. The annual "we need to upgrade everything" project was a known horror. We rolled out Renovate with sensible defaults and let upgrades flow continuously.
< 14d
- 88
Off Heroku, onto a bill that scales.
A B2B marketplace had outgrown Heroku — both the bill ($38k/month for Performance dynos and Heroku Postgres) and the operational ceiling. We moved a six-year-old Rails monolith plus three microservices to ECS Fargate and Aurora Postgres without rewriting the deploy pipeline.
−66%
- 89
Sandbox accounts that clean themselves.
An engineering org had 140 sandbox AWS accounts and a $42k/month sandbox bill. Most of the spend was in twelve accounts whose owners had left the company. We built a lifecycle pipeline that watches for ownership, sets budgets, and decommissions silently.
−68%
- 90
EU customer data that stays in the EU, demonstrably.
A B2B SaaS expanding into Germany hit a Data Processing Addendum requirement: EU customer data must be stored, processed, and backed up exclusively in EU regions, with cryptographic enforcement and verifiable evidence. We re-architected the data plane for verifiable residency.
100%
- 91
DynamoDB on-demand was cheaper. Until it wasn’t.
A gaming backend ran 140 DynamoDB tables, all on on-demand capacity, because the team had read "start with on-demand" three years ago and never revisited. Half the tables had stable, predictable traffic. The DynamoDB bill was $48k a month. We rebalanced and brought it to $19k.
−60%
- 92
Off Magento, onto something the team can actually maintain.
A speciality e-commerce site ran on Magento 2 for seven years, accumulating 140 third-party modules and a deployment process nobody trusted. We migrated to a headless commerce architecture on AWS with Next.js on the front and commercetools handling catalog + checkout.
−61%
- 93
A GovCloud footprint, established before the contract started.
A federal subcontractor had been awarded a contract requiring all workload data to reside in GovCloud (US) by Q1. The team had never operated in GovCloud and the AWS account-vetting process was already in flight. We delivered the GovCloud landing zone, the ITAR controls baseline, and a working pilot workload in eight weeks.
ON TIME
- 94
Egress inspection that actually stops the data exfil chains.
A crypto exchange had egress inspection from a legacy third-party appliance that handled traffic-by-IP but did not understand TLS-encrypted command-and-control patterns. After a near-miss with a compromised dependency, we deployed AWS Network Firewall with managed rule groups and got coverage that matched the modern threat model.
1,143
- 95
Threat-hunting that the SOC can actually finish before their shift ends.
An online gaming company’s SOC was investigating GuardDuty findings by hand — pulling CloudTrail, VPC Flow Logs, and DNS data into Athena queries and assembling the picture manually. A medium-severity investigation took two hours. We rolled out Amazon Detective with its prebuilt investigation graphs.
2h → 18m
- 96
Batch compute on Spot, with interruption you don’t notice.
A drug discovery company ran molecular dynamics simulations on AWS Batch — 6,000 vCPU-hours a day, all on-demand because earlier Spot attempts had been "too unstable." We rebuilt the Spot Fleet with proper diversification, capacity-optimised allocation, and a tolerant job runner.
−78%
- 97
Image hosting, brought back inside the AWS account.
An online publishing company paid Cloudinary $14,800/month for image hosting and on-the-fly transformations. The application was AWS-hosted; every image request went out of the AWS network and back. We replicated the transformation capability on CloudFront + Lambda@Edge and moved storage to S3.
−78%
- 98
Bad deploys that roll themselves back.
A SaaS commerce platform had a deploy procedure where a human watched dashboards for ten minutes after each deploy to decide if it had gone well. About once a month they missed something subtle and customers noticed. We enabled the ECS deployment circuit breaker with appropriate health alarms.
< 2m
- 99
Database migrations the team trusts, even at 2am.
A B2B SaaS company ran database migrations through a homegrown shell-script orchestration that occasionally failed in surprising ways. Production migrations had become a five-engineer ceremony. We migrated to Atlas + Flyway with a proper schema-change workflow.
5 → 1
- 100
A preview environment per pull request, without bankrupting the team.
An education tech company wanted preview environments for every pull request — but their first attempt had spun up a full-stack copy per PR and blew the budget in two weeks. We rebuilt it as namespace-scoped previews on a single shared cluster, with on-demand databases and aggressive teardown.
$0.18/hr
- 101
The NAT gateway bill that nobody had been looking at.
An adtech platform’s monthly network charges had quietly grown to $34k — most of it NAT Gateway egress for traffic that should never have been leaving the VPC. We mapped the egress flows, added VPC Endpoints for the AWS-bound traffic, and used PrivateLink for the third-party flows.
−71%
- 102
Forty workloads out of one data centre, before the lease ended.
A logistics tech company had a single colo with 40 production workloads, six months left on the lease, and no extension option. We ran the assessment, built the landing zone, and migrated everything with thirteen days of slack at the end.
40
- 103
Marketing send-storms that don’t blow up the downstream.
A marketing automation product would tip its downstream over once a quarter when a customer triggered a 200,000-recipient campaign. The downstream — third-party email providers and the platform’s own webhook receivers — couldn’t absorb the burst. We introduced SQS-based load levelling with bounded concurrency.
−100%
- 104
A lockdown SCP rehearsed before the incident.
A crypto custody firm’s incident response runbook said "in case of confirmed breach, lock down the affected accounts." Nobody had ever tested how. We built an emergency-lockdown SCP, rehearsed it in a tabletop exercise, and added it to the response playbook with a documented activation path.
< 90s
- 105
Lambda memory tuned for cost, not for vibe.
A logistics tech company had 320 Lambda functions, all sized at 1024 MB because that was the company default someone had set in 2019. Some functions were under-utilised; some were starved. We ran AWS Lambda Power Tuning across the fleet and right-sized everything.
−41%
- 106
A 14-year-old LDAP server, gracefully retired.
A university network ran a 14-year-old OpenLDAP server as the authentication backbone for 30+ internal applications. It worked. It was also a single point of failure with no maintainer. We migrated identity to Cognito and let LDAP retire with dignity.
YES
- 107
Engineering cost ownership, by team, by day.
A B2B platform had an aggregate AWS bill the CFO saw monthly and zero per-team visibility into who spent what. Cost decisions were centralised in a small platform team. We built per-team cost dashboards and gave the teams the data.
DAILY
- 108
One account, six years of debt, refactored without an outage.
A profitable e-commerce platform had been running everything in a single AWS account since 2018. Production, staging, dev, marketing experiments, the founder’s side project — all in one IAM blast radius. We split it across an organisation without a single hour of downtime.
−87%
- 109
Privileged sessions on the record, by default.
A defence contractor had to demonstrate to a government auditor that every privileged shell session on a production host was logged with full transcript. They had been doing this through screen-recording on Workspaces. We replaced it with SSM Session Manager logging and a tamper-evident archive.
100%
- 110
Outages we hear about from monitoring, not customers.
A B2B logistics platform had had three outages in twelve months where customers reported the problem before monitoring did. The platform had monitoring — Prometheus, CloudWatch alarms, the works — but it monitored the components, not the customer-perceptible behaviour. We added CloudWatch Synthetics canaries running the critical user journeys every minute.
< 2m
- 111
A self-service catalog the central team isn’t a bottleneck for.
A manufacturing company’s cloud team was running infrastructure as a ticket queue. Every new database, every new VPC, every new IAM role was a ticket. We launched AWS Service Catalog with curated portfolios and let the engineering teams self-serve from approved patterns.
−81%
- 112
A "lose an AZ" playbook the team has actually run.
A B2B SaaS platform claimed multi-AZ on every architecture diagram but had never tested losing one. The first chaos drill — drain a production AZ at 14:00 on a Wednesday — surfaced six different failure modes. We worked through each and turned the drill into a quarterly exercise.
< 90s
- 113
Lambda secrets that never landed in an environment variable.
An accounting SaaS company had 340 Lambda functions, most of them with credentials in environment variables. The credentials were Parameter Store references, sure — but the resolved values were sitting in `aws lambda get-function-configuration` output, readable by anyone with `lambda:GetFunctionConfiguration`. We moved every secret out of env vars and into the Lambda extension pattern.
0
- 114
Transactional email that lands, and that nobody can spoof.
A B2B SaaS company sent transactional email through SES with DKIM half-configured and no DMARC policy. Phishing emails impersonating their domain had hit two enterprise customers. We rolled out full DKIM + DMARC enforcement at p=reject, with a careful warm-up.
p=reject
- 115
A Singapore region, opened in eight weeks.
A B2B SaaS company expanding into APAC had two enterprise customers conditioning their renewal on Singapore data residency. The team had never operated outside us-east-1 and us-west-2. We built the ap-southeast-1 deployment, retrofitted the application for region-aware routing, and certified it in time for the renewal.
8w
- 116
Logs we keep forever, on storage that knows we’re lying.
A B2B platform had three S3 buckets holding CloudTrail logs, VPC Flow Logs, and ALB access logs. All three were on Standard storage, indefinitely. The buckets had grown to 380TB and were costing $9k/month. We applied lifecycle policies sized to the actual access pattern.
−86%
- 117
Workflows that fail well, not just fail.
A document workflow company had 30 Step Functions workflows where the error-handling pattern was "if anything fails, the workflow fails." Failures landed in CloudWatch and waited for a human. We refactored each workflow with proper retry, catch, and DLQ patterns.
−84%
- 118
Feature flags that retire themselves when they should.
A B2C streaming company had 480 active feature flags in their feature-flag service. About 60% had been at 100% rollout for over a year — flags that should have been removed but were technical debt nobody scheduled. We built a governance layer that flagged stale flags and automated the cleanup.
247
- 119
Oracle to Aurora Postgres, no leftover PL/SQL.
An order management platform had a 2.4TB Oracle database with 380 stored procedures and a five-figure monthly licence. We migrated to Aurora Postgres with Schema Conversion Tool plus a careful refactor of the PL/SQL — and finished the engagement with zero remaining Oracle dependencies.
$340K/yr
- 120
Read replicas the application actually uses.
A B2B analytics platform had four Aurora read replicas that the application sent zero traffic to. Every query went to the writer. The writer had been scaled up four times in two years. We introduced RDS Proxy with read-only endpoints and the application started using the replicas the next day.
−63%
- 121
Two engineering orgs, one AWS organisation, zero customer surprises.
A health insurance tech company acquired a smaller competitor with eleven production AWS accounts and a different identity provider. We merged the smaller org into the larger one, unified identity through Identity Center, and didn’t cause a single customer-visible incident.
0
- 122
On-prem to AWS, with two paths that both work.
A healthcare payer ran Direct Connect from their on-prem data centre to AWS as a single physical path. A maintenance window from the carrier had caused a six-hour business outage. We added PrivateLink-over-internet as a hot standby and rehearsed the failover.
2
- 123
Twilio for SMS, SES for email, both for less.
An on-demand services platform was using Twilio for both SMS and transactional email at $24,400/month. Half of that was email — a use case Twilio happens to support but isn’t especially cheap at. We kept Twilio for SMS (where they’re strong) and moved email to SES.
−42%
- 124
Forty Jenkins pipelines, off two on-prem servers in twelve weeks.
An aerospace contractor ran 40 Jenkins pipelines on two on-prem servers that had become a single point of failure (and a single point of CVE management). We migrated to AWS CodeBuild with a careful pipeline translation, and retired the Jenkins boxes.
AWS-MANAGED
- 125
Old services that retire on schedule, not on incident.
An enterprise SaaS company had 23 deprecated services still running because no one had a clean process for decommissioning. Two of them caused incidents in a year. We built a service-deprecation framework with timelines, callers-inventory tracking, and an automated decommission flow.
17
- 126
Vulnerability management that scales to two hundred accounts.
A healthcare ISV ran 200 production AWS accounts across customer-isolated environments. Their vuln management was a quarterly export from Inspector, manually triaged in a spreadsheet by one person, with a P0-to-patch median of fourteen days. We rebuilt it as a continuous workflow with a sub-72-hour P0 SLA.
14d → 56h
- 127
CloudWatch metrics, streamed centrally, queryable everywhere.
A B2B platform had observability per account — each team kept its own CloudWatch dashboards in its own account. Cross-account incident correlation took an engineer half a day per incident. We turned on CloudWatch Metric Streams across the org and landed everything in a central Prometheus-compatible store.
< 2m
- 128
Aurora maintenance windows that the team rehearses.
A fintech had been treating Aurora minor-version upgrades and maintenance windows as a "fingers crossed" event — sometimes they were fine, sometimes a workload broke. We instituted quarterly rehearsals against a clone of production using Aurora’s blue/green deployment feature.
0
- 129
A service catalog the platform team didn’t have to nag people to update.
An enterprise SaaS company had a Backstage installation with 12 services registered out of an actual 73 in production. Engineers had been asked to register; nobody had. We rebuilt the registration as a build-time emission, so services registered themselves on first push.
73 / 73
- 130
SNS deliveries that don’t silently vanish.
A notification platform fanned messages out through SNS to dozens of downstream subscribers. When a subscriber endpoint failed, SNS would retry briefly and then drop. The platform’s customers were quietly losing notifications. We added DLQs and proper delivery monitoring across every topic.
−100%
- 131
DORA metrics, computed from systems engineering already uses.
A fintech CTO wanted DORA metrics — deploy frequency, lead time, MTTR, change failure rate — without standing up a separate observability vendor. We built the dashboard from GitHub Actions, CloudWatch, and PagerDuty data the team was already producing.
4 DORA + 6 CUSTOM
- 132
"What do we own and where?", answered by a search box.
A healthcare platform had grown to 40 accounts and nobody could answer simple inventory questions ("how many Lambdas across the org," "which accounts have RDS in eu-west-1") without spending half a day on aggregated CLI scripts. We deployed AWS Resource Explorer with a cross-account aggregator index.
< 10s
- 133
Step Functions Express, where Standard was overkill.
A marketing automation company ran 14 Standard Step Functions workflows for short, high-volume orchestration tasks. They were paying for state-transition cost they didn’t need. We migrated the right workflows to Express and dropped Step Functions cost 62%.
−62%
- 134
SLOs for batch jobs, not just synchronous APIs.
A data analytics company had SLOs for their synchronous API endpoints but nothing equivalent for their 22 batch pipelines. Pipeline freshness, completeness, and latency were operationally important but not measured. We introduced batch SLOs against freshness and completeness, with burn-rate alerting.
22 / 22
- 135
Eight years of Confluence, ported to Notion without breaking links.
A media company had eight years of organisational knowledge in self-hosted Confluence, with a search experience the team had given up on. We migrated 14,400 pages to Notion with link integrity preserved and an S3-backed archive of the original Confluence export.
14,400
- 136
Chaos engineering that the on-call team actually wanted.
A streaming platform had monthly post-incident reviews that were starting to repeat themselves. The same three failure modes kept resurfacing. We introduced a chaos engineering practice that the on-call team welcomed — because the experiments were aimed at the things they were already worried about, not arbitrary fault injection.
3 → 0
- 137
A private CA hierarchy that engineers can actually use.
An industrial IoT company had a homegrown CA running on a single EC2 instance, with a 4096-bit private key on an EBS volume nobody had rotated in three years. Every new device type required a manual signing ceremony. We replaced it with ACM Private CA hierarchy and a self-service signing API.
0
- 138
Twenty-eight workloads off on-prem, in fourteen weeks.
A logistics SaaS with a single data centre lease ending in five months. Twenty-eight production workloads. Half of them critical, half of them undocumented. We assessed every one with the 6R framework and shipped the migration in four phased waves.
28
- 139
SLOs that survive contact with quarterly planning.
A B2B logistics platform had monitoring, dashboards, and a "99.9% uptime" promise on their marketing site. They had no SLOs, no error budgets, and no way to make engineering trade-offs against reliability. We rolled out an SLO framework that survived its first quarterly planning cycle.
17 → 28
- 140
Cost allocation that finance trusts because tags actually exist.
A retail e-commerce company had been promising the finance team a per-team cost report for two years. The blocker was always the same: tag coverage hovered around 60% and what existed wasn’t consistent. We rolled out org-level Tag Policies plus SCP enforcement, with a six-week amnesty for backfill.
99.6%
- 141
Day one to first PR, in under an hour.
A B2B SaaS company had a multi-day developer onboarding — provision laptop, request AWS access, get cloned into 14 GitHub repos, install three different CLI tools, set up the dev environment. New hires routinely took a week to ship their first PR. We automated the path from "first day" to "first PR" down to under an hour.
47m
- 142
Cascading failures that stop at the boundary.
A travel booking platform had an architecture where any third-party API slowdown cascaded into a full-platform incident. Hotel-search outages caused car-rental outages caused payment outages. We rolled out circuit breakers, bulkheads, and timeouts across the boundary calls.
0
- 143
DynamoDB reads cached in front, capacity dialled down behind.
A gaming backend ran a read-heavy DynamoDB table at provisioned 80k RCU. Most of the reads were repeated within a few seconds — game session state polled every 500ms. We put DynamoDB Accelerator (DAX) in front and dropped the RCU floor to 12k.
−71%
- 144
Marketing email that survives a million-recipient send.
A D2C retail brand sent monthly newsletters and weekly promotional campaigns to 1.2M subscribers via Mailchimp. The contract had inflated to $84k/year. We migrated to Pinpoint with SES on the send side, preserving the segmentation logic the marketing team relied on.
−72%
- 145
CloudFront cache hit rate, doubled.
A news publisher served 9 PB/month from CloudFront with a 42% hit rate. The cache wasn’t broken — the cache keys were. Querystrings, cookies, and User-Agent variations fragmented the cache so badly that the same article was being cached as 40+ distinct objects.
89%
- 146
Untagged resources that retag themselves.
An adtech company had a 31% tag coverage problem and a finance team that had given up asking for cost-by-team reports. We deployed a Config-rule-and-Lambda-remediator combination that auto-tagged resources from CloudTrail data on creation events.
31% → 97%
- 147
School logins that just work, on every district’s SSO.
An EdTech company sold to school districts, each with their own identity provider (Google Workspace, Microsoft Entra, ClassLink, a handful of district-specific SAML implementations). Their auth had been a fragile collection of district-specific code paths. We consolidated on Cognito federated identity providers.
94%
- 148
Email migration with deliverability that actually improved.
A B2B SaaS company had moved off SendGrid twice before — both times deliverability had degraded and they’d moved back. We did it a third time, with proper IP warming and deliverability monitoring, and got deliverability that matched SendGrid by week three and beat it by month two.
−68%
- 149
Heroku Postgres to Aurora, before the contract auto-renewed.
A booking platform’s application had already moved off Heroku Dynos but Heroku Postgres remained — a Standard 0-plan with auto-renewal three months out, at $14k/month. We migrated the database to Aurora Postgres with DMS continuous replication and finished the cutover in ten weeks.
−71%
- 150
On-prem Active Directory, gracefully retired.
A healthcare IT company had on-prem Active Directory serving 1,400 employees, with two domain controllers that had been "good enough" for a decade. The hardware refresh was due, the cost was rising, and the team had no appetite to renew. We migrated identity to IAM Identity Center + Microsoft Entra ID with a clean SCIM sync.
2 → 0
- 151
Paywalled content that doesn’t leak through scraping.
A digital publishing platform was losing measurable subscription value to scraping. Their CloudFront distribution served paywalled PDFs over public URLs; the subscriber check happened on the page, not on the asset. We retrofitted CloudFront Signed URLs across the asset surface without breaking legitimate flows.
−98%
- 152
Self-hosted Redis, retired without anyone noticing.
A gaming SaaS company ran self-hosted Redis on EC2 — a six-node cluster with the operational responsibility quietly resting on one engineer. We migrated to ElastiCache for Redis with no application code changes and no observable downtime.
−92%
- 153
A DNS failover that the customer never sees.
An e-commerce platform had a hot-standby second region but had never tested a failover under real traffic. Their previous attempt at DNS failover had taken 4 minutes to converge and had pointed half the traffic at a stale endpoint. We rebuilt it around Route 53 health checks and tight TTLs.
< 60s
- 154
Local dev that runs what production runs.
A logistics platform had a local dev environment that used SQLite where production used Aurora, an in-memory queue where production used SQS, and no Lambda runtime at all. "Worked locally, broke in staging" was a weekly occurrence. We brought local dev to production-parity using LocalStack.
−84%
- 155
gp2 to gp3, with the IOPS provisioned for what the workload actually needs.
An engineering analytics company had 240TB of gp2 EBS volumes — most of them oversized because gp2 couples IOPS to capacity. We migrated to gp3 with IOPS sized to observed peak, and dropped the EBS bill 38%.
−38%
- 156
SQS messages processed once, not three times.
An identity verification platform had an SQS queue feeding a Lambda consumer with the visibility timeout set to the Lambda’s 30-second timeout. About 3% of messages were being processed two or three times because the Lambda occasionally ran longer than 30 seconds. We tuned the visibility timeout and idempotency together.
−99%
- 157
API throttling that protects the backend without punishing the user.
A mobile fitness app had API endpoints that occasionally saw runaway client behaviour — a sync bug retrying every 100ms, an off-the-shelf scraper, a bot net learning the auth pattern. Each event hammered the backend. We added API Gateway usage plans with tiered throttling.
−93%
- 158
Style and secret violations that fail at commit, not at PR review.
A devtools company had a CI pipeline that caught linting violations, formatting issues, and committed secrets — but only after the developer had pushed and waited five minutes. We rolled out pre-commit hooks with the same checks running locally in under a second.
−96%
- 159
The platform team’s SLA, made measurable.
A B2B SaaS platform team had an "internal SLA" with its application-team customers — uptime for shared services like the CI cluster, the artifact registry, the secrets store. The SLA was claimed; it was never measured. We built the measurement and a public-internal dashboard.
14
- 160
PR review meta-work, off the engineers’ plate.
An adtech company’s PR review process was 30% mechanical — checking that the right reviewers were assigned, that the CI passed, that the description mentioned a JIRA ticket. The other 70% was the actual review. We automated the mechanical part.
AUTOMATED
- 161
Three observability stacks, one bill, one source of truth.
A travel platform had Datadog ($28k/mo), New Relic ($14k/mo), and a self-hosted Prometheus/Grafana stack on EKS ($6k/mo of compute). Three teams, three vendors, three on-call experiences. We consolidated to a single stack and saved $36k a month, without losing any monitoring capability.
−75%
- 162
Network captures from before the incident started.
A cryptocurrency exchange had had a near-miss that they could not fully reconstruct because they had no packet-level capture of the attacker traffic. We turned on VPC Traffic Mirroring against the customer-facing API tier with a 72-hour rolling retention, so the next investigation would have ground truth.
72h
- 163
Self-hosted GitLab, retired without losing a commit.
A fintech ran self-hosted GitLab on a 24-core EC2 instance with a Postgres backend, paying for the licence plus operating the infrastructure. The team’s opinion had quietly shifted to "let GitHub Enterprise host it." We migrated 312 repos, 40k issues, and 14 CI pipelines, lost nothing.
0
- 164
RDS minor versions that update themselves.
A healthcare claims company had 28 RDS instances across the org, with minor versions ranging from 2 to 14 versions behind. Audit had flagged it. We rolled out automated minor-version upgrades with a tiered cadence and prerequisite rehearsals.
28 / 28
- 165
MongoDB Atlas to DocumentDB, with the apps unchanged.
A mobile app backend with 14 services talking to MongoDB Atlas had a $9,400/mo cluster bill, plus egress charges as the application moved more workload to AWS. We migrated to DocumentDB compatible mode, preserving the MongoDB driver code in the apps.
−56%
- 166
Three AWS organisations merged, one finance report.
A private equity acquisition brought together three AWS organisations from three distinct portfolio companies. Each had its own payer account, its own EDP commitment, and its own tagging conventions. We consolidated them under a single payer while preserving each company’s budget identity.
$2.8M
- 167
A surprise on the AWS bill, every month, that nobody minds.
A software vendor’s finance team got the AWS bill on the third of every month and the engineering team got the angry email on the fourth. We deployed Cost Anomaly Detection at the org level with detectors scoped per team, and the angry email stopped arriving.
−95%
- 168
A primary origin that can fail, with a secondary already ready.
A restaurant booking platform served static fallback content for their app when the dynamic API was down — a "we’re experiencing issues, check back soon" page. Until the API actually went down, when it turned out the fallback wasn’t wired to anything. We added CloudFront Origin Failover.
GRACEFUL
- 169
On-call tooling, migrated without a missed page.
A healthcare SaaS company had Opsgenie for on-call routing with an expiring contract and a vendor-direction shift the team didn’t want to follow. We migrated to PagerDuty over six weeks, with both systems live in shadow during the cutover.
0
- 170
Audit logs the regulator can’t accidentally edit.
A regional bank had CloudTrail enabled, logs landing in S3, and a regulator who had started asking how the team could prove the logs hadn’t been tampered with. The honest answer was "trust." We rebuilt the audit log archive with S3 Object Lock in compliance mode and a clean chain of custody.
7y
- 171
Field-level encryption that the audit team likes and finance can afford.
An insurance broker terminated PII fields client-side, sending them as separately-encrypted blobs to a back-end decryption service. The architecture worked but the operational cost of the decryption service was high. We replaced it with CloudFront Field-Level Encryption.
RETIRED
- 172
On-call runbooks that the next person on rotation can actually use.
A B2B SaaS platform had eighteen services, eighteen different on-call rotations, and eighteen different runbook formats — most of them outdated or missing. New rotation members spent their first quarter in survival mode. We standardised the runbook format and the on-call onboarding.
< 2w
- 173
BYOL licences tracked back to the agreements that bought them.
An engineering consulting firm ran a mix of BYOL Windows workloads, SQL Server instances, and Oracle databases. They were under-utilising their Microsoft enterprise agreement and over-buying spot Windows on AWS. We rolled out License Manager with managed entitlements and tied every BYOL workload back to a tracked agreement.
+58%
- 174
NAT Gateway, replaced by an instance where the math says.
An internal IT team ran low-throughput VPCs (single-AZ test environments, internal dev clusters) where the NAT Gateway hourly cost dominated the real network usage. We replaced 14 of them with EC2-based NAT instances on t4g.nano with appropriate guardrails.
−83%
- 175
A platform that scales past the founding team.
A seed-stage developer tools company with three engineers, shipping to ten beta customers, and a clear "Series A in six months" deadline. They needed an AWS foundation that wouldn’t embarrass them at diligence — without spending the whole runway on it.
0
- 176
A petabyte of imagery, moved without paying for egress at internet speeds.
A geospatial archive had 4.8 petabytes of historical imagery in on-premise tape storage that the regulator wanted off-site by year-end. Over their existing 10Gbps internet link, the transfer would have taken 41 months. We used AWS Snowmobile (the literal truck-with-a-shipping-container) and finished in eleven weeks.
4.8 PB
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.