CASE 48 · LUMEN · 2026
Model training on Spot, checkpoint-resumed.
A computer vision team trained models on SageMaker with on-demand p3.16xlarge instances at roughly $24/hour each. A full training run took 96 hours. Their monthly training bill ran $42k. We moved training to managed Spot with checkpoint-resume and brought it under $11k.
Computer vision
COST
2026
RESULTS
What changed, by the numbers.
TRAINING COST
−74%
WALL-CLOCK TIME
+12%
CHECKPOINT FREQ
EVERY 30m
MODEL QUALITY
0% Δ
HOW IT WENT
The team had assumed Spot was unusable for training because a single interruption mid-run would lose 30 hours of progress. That had been true with their original training code, which kept all state in instance memory.
We refactored the training loop to use PyTorch Lightning’s built-in checkpointing, with checkpoints written to S3 every 30 minutes. SageMaker’s managed Spot mode handled the interruption-and-resume automatically; checkpoints replayed from the most recent S3 snapshot. Step Functions orchestrated the multi-stage training pipeline.
Interruptions averaged about one per training run, with a 6-12 minute recovery cost. The 12% wall-clock overhead was small compared to the 74% cost reduction. The team now experiments more freely — three concurrent training runs cost less than one used to.
RELATED · SAME DOMAIN
Other engagements in this space.
READY WHEN YOU ARE
Let's get your AWS bill (and architecture) in order.
The discovery call is free. You walk away with at least one concrete idea — even if we never work together.