Zhivko Todorov
ALL CASE STUDIES

CASE 48 · LUMEN · 2026

SAGEMAKERSPOTTRAININGCHECKPOINTING

Model training on Spot, checkpoint-resumed.

A computer vision team trained models on SageMaker with on-demand p3.16xlarge instances at roughly $24/hour each. A full training run took 96 hours. Their monthly training bill ran $42k. We moved training to managed Spot with checkpoint-resume and brought it under $11k.

INDUSTRY

Computer vision

DOMAIN

COST

DELIVERED

2026

STACK

SAGEMAKER·MANAGED SPOT·EFS·S3 CHECKPOINTING·PYTORCH LIGHTNING·STEP FUNCTIONS

RESULTS

What changed, by the numbers.

TRAINING COST

−74%

$42K → $11K / MONTH

WALL-CLOCK TIME

+12%

INTERRUPTION RECOVERY

CHECKPOINT FREQ

EVERY 30m

DURABLE TO S3

MODEL QUALITY

0% Δ

NO REGRESSION

HOW IT WENT

The team had assumed Spot was unusable for training because a single interruption mid-run would lose 30 hours of progress. That had been true with their original training code, which kept all state in instance memory.

We refactored the training loop to use PyTorch Lightning’s built-in checkpointing, with checkpoints written to S3 every 30 minutes. SageMaker’s managed Spot mode handled the interruption-and-resume automatically; checkpoints replayed from the most recent S3 snapshot. Step Functions orchestrated the multi-stage training pipeline.

Interruptions averaged about one per training run, with a 6-12 minute recovery cost. The 12% wall-clock overhead was small compared to the 74% cost reduction. The team now experiments more freely — three concurrent training runs cost less than one used to.

READY WHEN YOU ARE

Let's get your AWS bill (and architecture) in order.

The discovery call is free. You walk away with at least one concrete idea — even if we never work together.

Or email directly →