Zhivko Todorov
ALL CASE STUDIES

CASE 122 · CREST · 2024

EC2 SPOTSPOT FLEETBATCHDIVERSIFICATION

Batch compute on Spot, with interruption you don’t notice.

A drug discovery company ran molecular dynamics simulations on AWS Batch — 6,000 vCPU-hours a day, all on-demand because earlier Spot attempts had been "too unstable." We rebuilt the Spot Fleet with proper diversification, capacity-optimised allocation, and a tolerant job runner.

INDUSTRY

Drug discovery

DOMAIN

COST

DELIVERED

2024

STACK

AWS BATCH·EC2 SPOT FLEET·CAPACITY-OPTIMIZED ALLOCATION·CHECKPOINTING·CLOUDWATCH METRICS

RESULTS

What changed, by the numbers.

COMPUTE COST

−78%

VS ON-DEMAND

JOB FAILURE RATE

< 0.3%

INCL. SPOT RETRIES

INSTANCE FAMILIES DIVERSIFIED

14

BROAD CAPACITY POOL

CHECKPOINT INTERVAL

15m

TUNED FOR INTERRUPT TOLERANCE

HOW IT WENT

The previous Spot attempt had used a single instance family (the most cost-effective on paper at the time). When that family experienced regional Spot pressure, the whole fleet evaporated and jobs failed in batches. The team concluded Spot was unreliable. It was the diversification that was unreliable.

We rebuilt the Spot Fleet with capacity-optimised allocation across 14 instance families covering a broad range of vCPU/memory ratios. The Batch job runner gained 15-minute checkpoints written to S3 — interruption no longer cost more than 15 minutes of progress per job.

Compute cost dropped 78% vs on-demand. Job failure rate including Spot-triggered retries is under 0.3%. The team now uses Spot for everything in Batch except the occasional emergency rerun that runs on-demand for speed.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →