Zhivko Todorov
ALL CASE STUDIES

CASE 52 · HALYARD · 2025

GCP → AWSDATAPROC → EMRBIGQUERY → REDSHIFTSPARK

Forty terabytes of Spark, off GCP in nine weeks.

A marketing analytics company ran a 40TB nightly Spark pipeline on GCP Dataproc with BigQuery storage. Their largest customer’s preferred-cloud clause triggered a forced migration. We rebuilt the pipeline on EMR + Redshift Spectrum + S3 without rewriting a single transformation.

INDUSTRY

Marketing analytics

DOMAIN

MIGRATION

DELIVERED

2025

STACK

EMR·REDSHIFT·REDSHIFT SPECTRUM·S3·GLUE CATALOG·STEP FUNCTIONS·DMS

RESULTS

What changed, by the numbers.

TIMELINE

9w

KICKOFF → CUTOVER

CODE CHANGES

< 5%

CONFIG-DRIVEN ABSTRACTION

PIPELINE RUNTIME

−12%

EMR SPOT FLEET

STORAGE COST

−34%

S3 + ICEBERG

HOW IT WENT

The migration brief was unusual: the data had to move, the schedules had to keep running, and the SQL had to stay nearly unchanged because the analyst team didn’t have the bandwidth to rewrite it. The "configuration of paths and credentials" path was the goal.

We staged the migration in three phases. First, parallel infra in AWS (EMR cluster, Redshift, S3 buckets, Glue Catalog seeded from BigQuery schemas). Second, dual-write of the nightly outputs so analysts could compare. Third, cutover after two weeks of identical outputs.

DMS handled the historical BigQuery export. EMR with Spot capacity ran the nightly Spark jobs at 12% less wall-clock time. Redshift Spectrum served the long-tail historical queries against S3 directly. The analyst team noticed nothing.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →