CASE 52 · HALYARD · 2025
Forty terabytes of Spark, off GCP in nine weeks.
A marketing analytics company ran a 40TB nightly Spark pipeline on GCP Dataproc with BigQuery storage. Their largest customer’s preferred-cloud clause triggered a forced migration. We rebuilt the pipeline on EMR + Redshift Spectrum + S3 without rewriting a single transformation.
Marketing analytics
MIGRATION
2025
RESULTS
What changed, by the numbers.
TIMELINE
9w
CODE CHANGES
< 5%
PIPELINE RUNTIME
−12%
STORAGE COST
−34%
HOW IT WENT
The migration brief was unusual: the data had to move, the schedules had to keep running, and the SQL had to stay nearly unchanged because the analyst team didn’t have the bandwidth to rewrite it. The "configuration of paths and credentials" path was the goal.
We staged the migration in three phases. First, parallel infra in AWS (EMR cluster, Redshift, S3 buckets, Glue Catalog seeded from BigQuery schemas). Second, dual-write of the nightly outputs so analysts could compare. Third, cutover after two weeks of identical outputs.
DMS handled the historical BigQuery export. EMR with Spot capacity ran the nightly Spark jobs at 12% less wall-clock time. Redshift Spectrum served the long-tail historical queries against S3 directly. The analyst team noticed nothing.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.