CASE 122 · CREST · 2024
Batch compute on Spot, with interruption you don’t notice.
A drug discovery company ran molecular dynamics simulations on AWS Batch — 6,000 vCPU-hours a day, all on-demand because earlier Spot attempts had been "too unstable." We rebuilt the Spot Fleet with proper diversification, capacity-optimised allocation, and a tolerant job runner.
Drug discovery
COST
2024
RESULTS
What changed, by the numbers.
COMPUTE COST
−78%
JOB FAILURE RATE
< 0.3%
INSTANCE FAMILIES DIVERSIFIED
14
CHECKPOINT INTERVAL
15m
HOW IT WENT
The previous Spot attempt had used a single instance family (the most cost-effective on paper at the time). When that family experienced regional Spot pressure, the whole fleet evaporated and jobs failed in batches. The team concluded Spot was unreliable. It was the diversification that was unreliable.
We rebuilt the Spot Fleet with capacity-optimised allocation across 14 instance families covering a broad range of vCPU/memory ratios. The Batch job runner gained 15-minute checkpoints written to S3 — interruption no longer cost more than 15 minutes of progress per job.
Compute cost dropped 78% vs on-demand. Job failure rate including Spot-triggered retries is under 0.3%. The team now uses Spot for everything in Batch except the occasional emergency rerun that runs on-demand for speed.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.