Spot Instances for Production? A Risk/Reward Analysis
Key takeaways
- Spot instances provide 70-90% savings over On-Demand but can be terminated with 2 minutes notice when AWS needs capacity
- Production viability depends on workload characteristics: stateless batch jobs are ideal, stateful databases are unsuitable
- Interruption rates vary by instance type and region (1-5% hourly in most cases, up to 20% during capacity crunches)
- Combining spot with on-demand in auto-scaling groups, using capacity-optimized allocation, and implementing graceful shutdown handlers enables production use
- Spot is production-ready for containerized web services, data processing, CI/CD, and rendering—but requires architectural investment in resilience
The Spot Instance Promise (and the Fine Print)
Your CFO just forwarded you the monthly AWS bill with one word: "Why?" Your ECS cluster is running 50 r6i.2xlarge instances 24/7 at $0.504/hour each = $18,144/month.
A junior engineer suggests: "Why don't we use Spot Instances? They're $0.0756/hour—we'd save $16,300/month."
Your first instinct is: "Spot Instances are for batch jobs, not production." But is that still true in 2025?
The reality: Spot Instances are production-viable for many workloads—if you architect for interruptions.
Understanding Spot Instance Economics
The Basic Model
AWS has excess capacity that varies by:
- Instance type
- Availability Zone
- Time of day
- Overall demand
Spot pricing:
- Up to 90% cheaper than On-Demand
- Price fluctuates based on supply/demand
- AWS can reclaim instances with 2 minutes notice
Real-World Spot Pricing (us-east-1, January 2025)
| Instance Type | On-Demand | Typical Spot | Savings | Interruption Rate |
|---|---|---|---|---|
| t3.medium | $0.0416/hr | $0.0125/hr | 70% | <1%/hour |
| m5.large | $0.096/hr | $0.0288/hr | 70% | 2-3%/hour |
| c5.xlarge | $0.17/hr | $0.051/hr | 70% | 2-4%/hour |
| r6i.2xlarge | $0.504/hr | $0.0756/hr | 85% | 3-5%/hour |
| g4dn.xlarge (GPU) | $0.526/hr | $0.158/hr | 70% | 5-10%/hour |
Key insight: Interruption rates are hourly, meaning a 3% interruption rate = ~72% chance of running a full day without interruption for a single instance.
The Hidden Costs
Spot isn't free savings—you pay with:
1. Engineering time:
- Graceful shutdown handlers
- State persistence logic
- Monitoring and alerting
- Runbook development
2. Infrastructure complexity:
- Mixed instance types
- Multiple AZs
- Fallback to On-Demand
- Capacity management
3. Operational overhead:
- Responding to interruptions
- Debugging spot-related failures
- Capacity planning across instance families
ROI calculation:
Gross savings: $16,300/month
Engineering cost: 40 hours × $150/hour = $6,000 (one-time)
Ongoing ops: 5 hours/month × $150 = $750/month
First month net: $16,300 - $6,000 - $750 = $9,550
Ongoing monthly net: $16,300 - $750 = $15,550
Break-even: Month 1
Annual savings: $186,600 - $15,000 (ops) = $171,600
Verdict: If you're spending over $5K/month on compute, Spot is worth investigating.
Workload Suitability Analysis
✅ Excellent Fit for Spot
1. Stateless Web Services (with graceful shutdown)
Characteristics:
- No local state
- Load balanced across many instances
- Can handle individual instance loss
- Connection draining available
Example: API server behind ALB
- 20 instances in Auto Scaling Group
- 70% Spot, 30% On-Demand minimum
- Instance termination = ALB drains connections over 120 seconds
- ECS handles task rescheduling automatically
Annual savings: $130K+ for medium-sized API
2. Batch Processing / Data Pipelines
Characteristics:
- Job-based workload
- Idempotent operations
- Checkpoint/resume capability
- Failure retry logic
Example: Video transcoding pipeline
- SQS queue with 10,000 jobs
- Spot Fleet scales 0-100 instances
- Job failure = message returns to queue
- Another instance picks it up
Annual savings: $200K+ for high-volume processing
3. Containerized Microservices
Characteristics:
- Container orchestration (ECS/EKS)
- Service mesh or discovery
- Multiple replicas
- Health checks and auto-restart
Example: E-commerce checkout service
- 8 ECS tasks across 4 instances
- Mix of Spot (75%) and On-Demand (25%)
- Task draining on interruption
- New tasks start on remaining capacity
Annual savings: $90K+ for typical microservices cluster
4. CI/CD Build Agents
Characteristics:
- Ephemeral workloads
- Build can restart on different host
- No external SLA impact
- High parallelism
Example: GitHub Actions self-hosted runners
- 50 Spot instances for parallel builds
- Interruption = job re-queued
- Build time increases slightly, cost drops 80%
Annual savings: $150K+ for active engineering org
5. Dev/Test/Staging Environments
Characteristics:
- Non-production workloads
- Tolerance for occasional downtime
- Easy to recreate state
- Lower availability requirements
Example: Staging environment
- Mirrors production architecture
- 100% Spot instances
- Interruption = redeploy (5 minutes)
Annual savings: $200K+ if staging mirrors production
⚠️ Possible with Careful Architecture
1. Kubernetes Worker Nodes
Challenges:
- Pod rescheduling overhead
- Potential for cascading failures
- StatefulSets require special handling
Solution:
# Mix of node groups
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: us-east-1
nodeGroups:
# On-Demand for critical stateful workloads
- name: on-demand-critical
instanceType: m5.large
desiredCapacity: 3
minSize: 3
maxSize: 10
labels:
workload-type: stateful
taints:
- key: workload-type
value: stateful
effect: NoSchedule
# Spot for stateless services
- name: spot-stateless
instancesDistribution:
instanceTypes:
- m5.large
- m5a.large
- m5n.large
- m5d.large
onDemandBaseCapacity: 2
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: capacity-optimized
desiredCapacity: 10
minSize: 3
maxSize: 50
labels:
workload-type: statelessWorkload deployment:
# Stateless API - tolerates spot interruptions
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 10
template:
spec:
# Allow scheduling on spot nodes
tolerations:
- key: workload-type
value: stateless
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: workload-type
operator: In
values:
- stateless
---
# Database - requires stable on-demand nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
replicas: 3
template:
spec:
# Require on-demand nodes
nodeSelector:
workload-type: stateful
tolerations:
- key: workload-type
value: stateful
effect: NoSchedule2. Long-Running ML Training
Challenges:
- Training can take hours/days
- Interruption = lost progress
- GPU instances have higher interruption rates
Solution: Checkpointing
import torch
import boto3
import os
from datetime import datetime
class SpotCheckpointer:
"""Handles checkpointing for Spot instance interruptions"""
def __init__(self, model, optimizer, s3_bucket, checkpoint_prefix):
self.model = model
self.optimizer = optimizer
self.s3_bucket = s3_bucket
self.checkpoint_prefix = checkpoint_prefix
self.s3 = boto3.client('s3')
# Monitor spot interruption notices
self.interruption_url = 'http://169.254.169.254/latest/meta-data/spot/instance-action'
def save_checkpoint(self, epoch, loss, metrics):
"""Save checkpoint locally and to S3"""
checkpoint = {
'epoch': epoch,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'loss': loss,
'metrics': metrics,
'timestamp': datetime.utcnow().isoformat()
}
# Save locally first (fast)
local_path = f'/tmp/checkpoint_epoch_{epoch}.pt'
torch.save(checkpoint, local_path)
# Upload to S3 (durable)
s3_key = f'{self.checkpoint_prefix}/checkpoint_epoch_{epoch}.pt'
self.s3.upload_file(local_path, self.s3_bucket, s3_key)
print(f"Checkpoint saved: s3://{self.s3_bucket}/{s3_key}")
def load_latest_checkpoint(self):
"""Load most recent checkpoint from S3"""
# List all checkpoints
response = self.s3.list_objects_v2(
Bucket=self.s3_bucket,
Prefix=self.checkpoint_prefix
)
if 'Contents' not in response:
return None
# Find latest checkpoint
checkpoints = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)
latest = checkpoints[0]['Key']
# Download and load
local_path = '/tmp/checkpoint_latest.pt'
self.s3.download_file(self.s3_bucket, latest, local_path)
checkpoint = torch.load(local_path)
self.model.load_state_dict(checkpoint['model_state_dict'])
self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
print(f"Resumed from epoch {checkpoint['epoch']}")
return checkpoint
def check_interruption_notice(self):
"""Check if instance is marked for termination"""
import requests
try:
response = requests.get(self.interruption_url, timeout=1)
if response.status_code == 200:
# Interruption notice received
return True
except requests.exceptions.RequestException:
# No interruption notice (404 is normal)
pass
return False
# Training loop with checkpointing
checkpointer = SpotCheckpointer(
model=model,
optimizer=optimizer,
s3_bucket='ml-training-checkpoints',
checkpoint_prefix='experiment-123'
)
# Resume from checkpoint if exists
checkpoint = checkpointer.load_latest_checkpoint()
start_epoch = checkpoint['epoch'] + 1 if checkpoint else 0
for epoch in range(start_epoch, num_epochs):
for batch_idx, (data, target) in enumerate(train_loader):
# Training step
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Check for interruption every 100 batches
if batch_idx % 100 == 0:
if checkpointer.check_interruption_notice():
print("Spot interruption notice received! Saving checkpoint...")
checkpointer.save_checkpoint(epoch, loss.item(), metrics={})
print("Checkpoint saved. Exiting gracefully.")
exit(0)
# Save checkpoint every epoch
checkpointer.save_checkpoint(epoch, loss.item(), metrics={})Result: Training survives interruptions with minimal loss (only current batch)
❌ Bad Fit for Spot
1. Stateful Databases (Primary)
Why not:
- 2-minute notice insufficient for clean shutdown
- Risk of data corruption
- Replication lag on failover
- Customer-facing downtime
Exception: Read replicas can use Spot if:
- Application handles replica unavailability
- Multiple replicas available
- Automated replica replacement
2. Single Points of Failure
Why not:
- No redundancy = outage on interruption
- Examples: Single NAT gateway, single Jenkins master, single Redis cache
Solution: Don't use Spot, or add redundancy first
3. Tight SLA Requirements
Why not:
- P99 latency SLAs impacted by interruptions
- Customer-facing impact
- No fallback capacity
Example: Payment processing with <500ms SLA
- Spot interruption = 2-minute degradation
- Use On-Demand with Savings Plans instead
Implementation Patterns
Pattern 1: ECS with Mixed Capacity
Terraform configuration:
resource "aws_ecs_cluster" "main" {
name = "production"
setting {
name = "containerInsights"
value = "enabled"
}
}
# Capacity provider: On-Demand base
resource "aws_ecs_capacity_provider" "on_demand" {
name = "on-demand-provider"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.on_demand.arn
managed_scaling {
status = "ENABLED"
target_capacity = 100
minimum_scaling_step_size = 1
maximum_scaling_step_size = 10
}
}
}
resource "aws_autoscaling_group" "on_demand" {
name = "ecs-on-demand"
vpc_zone_identifier = var.private_subnet_ids
min_size = 2
max_size = 20
desired_capacity = 2
launch_template {
id = aws_launch_template.on_demand.id
version = "$Latest"
}
tag {
key = "Name"
value = "ecs-on-demand"
propagate_at_launch = true
}
}
resource "aws_launch_template" "on_demand" {
name_prefix = "ecs-on-demand-"
image_id = data.aws_ami.ecs_optimized.id
instance_type = "m5.large"
iam_instance_profile {
name = aws_iam_instance_profile.ecs.name
}
user_data = base64encode(<<-EOF
#!/bin/bash
echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
EOF
)
lifecycle {
create_before_destroy = true
}
}
# Capacity provider: Spot instances (multiple instance types)
resource "aws_ecs_capacity_provider" "spot" {
name = "spot-provider"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.spot.arn
managed_scaling {
status = "ENABLED"
target_capacity = 100
minimum_scaling_step_size = 1
maximum_scaling_step_size = 10
}
}
}
resource "aws_autoscaling_group" "spot" {
name = "ecs-spot"
vpc_zone_identifier = var.private_subnet_ids
min_size = 0
max_size = 50
desired_capacity = 8
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 0
on_demand_percentage_above_base_capacity = 0
spot_allocation_strategy = "capacity-optimized"
spot_instance_pools = 0
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.spot.id
version = "$Latest"
}
# Multiple instance types for diversification
override {
instance_type = "m5.large"
}
override {
instance_type = "m5a.large"
}
override {
instance_type = "m5n.large"
}
override {
instance_type = "m5d.large"
}
}
}
tag {
key = "Name"
value = "ecs-spot"
propagate_at_launch = true
}
}
resource "aws_launch_template" "spot" {
name_prefix = "ecs-spot-"
image_id = data.aws_ami.ecs_optimized.id
instance_type = "m5.large" # Default, overridden by mixed_instances_policy
iam_instance_profile {
name = aws_iam_instance_profile.ecs.name
}
user_data = base64encode(<<-EOF
#!/bin/bash
echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
EOF
)
lifecycle {
create_before_destroy = true
}
}
# Cluster capacity provider strategy
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = [
aws_ecs_capacity_provider.on_demand.name,
aws_ecs_capacity_provider.spot.name
]
default_capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.on_demand.name
weight = 1
base = 2 # Minimum 2 on-demand instances
}
default_capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.spot.name
weight = 4 # 80% of scaling goes to spot
base = 0
}
}
# ECS Service configuration
resource "aws_ecs_service" "api" {
name = "api-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 10
# Use cluster's default capacity provider strategy
capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.on_demand.name
weight = 1
base = 2
}
capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.spot.name
weight = 4
base = 0
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = "api"
container_port = 8080
}
# Enable connection draining
deployment_configuration {
minimum_healthy_percent = 50
maximum_percent = 200
}
}Key features:
- 2 on-demand instances (base capacity)
- Spot instances handle 80% of scale-out
- Multiple instance types for availability
capacity-optimizedallocation strategy- Automatic spot draining enabled
Pattern 2: Graceful Shutdown Handler
Node.js application with spot handling:
import express from 'express';
import http from 'http';
import axios from 'axios';
const app = express();
const server = http.createServer(app);
let isShuttingDown = false;
// Health check endpoint
app.get('/health', (req, res) => {
if (isShuttingDown) {
// Return 503 to fail ALB health checks
// This removes instance from load balancer
res.status(503).json({ status: 'shutting_down' });
} else {
res.json({ status: 'healthy' });
}
});
// Application routes
app.get('/api/users', async (req, res) => {
// Your business logic
res.json({ users: [] });
});
// Graceful shutdown handler
async function gracefulShutdown(signal: string) {
console.log(`${signal} received. Starting graceful shutdown...`);
isShuttingDown = true;
// 1. Stop accepting new connections
server.close(() => {
console.log('HTTP server closed');
});
// 2. Wait for ALB to drain connections (typically 30-120 seconds)
console.log('Waiting 30 seconds for connection draining...');
await new Promise(resolve => setTimeout(resolve, 30000));
// 3. Close database connections
console.log('Closing database connections...');
await db.close();
// 4. Flush metrics
console.log('Flushing metrics...');
await metrics.flush();
console.log('Graceful shutdown complete');
process.exit(0);
}
// Handle termination signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
// Spot interruption monitor (runs every 5 seconds)
async function monitorSpotInterruption() {
try {
const response = await axios.get(
'http://169.254.169.254/latest/meta-data/spot/instance-action',
{ timeout: 1000 }
);
if (response.status === 200) {
const interruptionData = response.data;
console.log('Spot interruption notice received:', interruptionData);
// Trigger graceful shutdown immediately
await gracefulShutdown('SPOT_INTERRUPTION');
}
} catch (error) {
// 404 means no interruption notice (normal)
// Other errors are connection issues (ignore)
}
}
// Check for interruption every 5 seconds
setInterval(monitorSpotInterruption, 5000);
const PORT = process.env.PORT || 8080;
server.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
});What this does:
- Monitors instance metadata for spot interruption notices
- Fails health checks when interruption detected
- Waits 30 seconds for ALB to drain connections
- Closes resources cleanly (DB, metrics)
- Exits gracefully allowing ECS to reschedule task
Pattern 3: Spot Fleet for Batch Processing
Lambda function triggering Spot Fleet:
import { EC2 } from 'aws-sdk';
import { SQS } from 'aws-sdk';
const ec2 = new EC2();
const sqs = new SQS();
interface SpotFleetConfig {
queueUrl: string;
targetQueueDepth: number;
spotFleetRequestId: string;
minCapacity: number;
maxCapacity: number;
}
export async function handler(event: any) {
const config: SpotFleetConfig = {
queueUrl: process.env.QUEUE_URL!,
targetQueueDepth: 100, // Target 100 messages per instance
spotFleetRequestId: process.env.SPOT_FLEET_ID!,
minCapacity: 0,
maxCapacity: 50
};
// 1. Get current queue depth
const queueAttrs = await sqs.getQueueAttributes({
QueueUrl: config.queueUrl,
AttributeNames: ['ApproximateNumberOfMessages']
}).promise();
const queueDepth = parseInt(
queueAttrs.Attributes?.ApproximateNumberOfMessages || '0'
);
// 2. Calculate desired capacity
const desiredCapacity = Math.min(
Math.max(
Math.ceil(queueDepth / config.targetQueueDepth),
config.minCapacity
),
config.maxCapacity
);
console.log(`Queue depth: ${queueDepth}, Desired capacity: ${desiredCapacity}`);
// 3. Update Spot Fleet target capacity
await ec2.modifySpotFleetRequest({
SpotFleetRequestId: config.spotFleetRequestId,
TargetCapacity: desiredCapacity
}).promise();
return {
statusCode: 200,
body: JSON.stringify({
queueDepth,
desiredCapacity,
timestamp: new Date().toISOString()
})
};
}Spot Fleet configuration:
resource "aws_spot_fleet_request" "batch_processing" {
iam_fleet_role = aws_iam_role.spot_fleet.arn
target_capacity = 0 # Controlled by Lambda
allocation_strategy = "capacityOptimized"
fleet_type = "maintain"
# Terminate instances with no tasks
terminate_instances_with_expiration = true
launch_specification {
ami = data.aws_ami.ecs_optimized.id
instance_type = "c5.2xlarge"
spot_price = "0.20" # Max price (current spot ~$0.10)
vpc_security_group_ids = [aws_security_group.batch.id]
subnet_id = var.private_subnet_ids[0]
iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
user_data = base64encode(<<-EOF
#!/bin/bash
# Start batch processor
docker run -d \
-e QUEUE_URL=${aws_sqs_queue.jobs.url} \
-e AWS_REGION=${var.aws_region} \
batch-processor:latest
EOF
)
}
# Add multiple instance types for availability
launch_specification {
ami = data.aws_ami.ecs_optimized.id
instance_type = "c5a.2xlarge"
spot_price = "0.20"
vpc_security_group_ids = [aws_security_group.batch.id]
subnet_id = var.private_subnet_ids[0]
iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
user_data = base64encode(<<-EOF
#!/bin/bash
docker run -d \
-e QUEUE_URL=${aws_sqs_queue.jobs.url} \
-e AWS_REGION=${var.aws_region} \
batch-processor:latest
EOF
)
}
launch_specification {
ami = data.aws_ami.ecs_optimized.id
instance_type = "c5n.2xlarge"
spot_price = "0.20"
vpc_security_group_ids = [aws_security_group.batch.id]
subnet_id = var.private_subnet_ids[0]
iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
user_data = base64encode(<<-EOF
#!/bin/bash
docker run -d \
-e QUEUE_URL=${aws_sqs_queue.jobs.url} \
-e AWS_REGION=${var.aws_region} \
batch-processor:latest
EOF
)
}
}
# Lambda to scale Spot Fleet based on queue depth
resource "aws_lambda_function" "spot_fleet_scaler" {
filename = "spot_fleet_scaler.zip"
function_name = "spot-fleet-autoscaler"
role = aws_iam_role.lambda.arn
handler = "index.handler"
runtime = "nodejs18.x"
timeout = 60
environment {
variables = {
QUEUE_URL = aws_sqs_queue.jobs.url
SPOT_FLEET_ID = aws_spot_fleet_request.batch_processing.id
}
}
}
# Trigger Lambda every minute
resource "aws_cloudwatch_event_rule" "spot_fleet_scaling" {
name = "spot-fleet-scaling"
schedule_expression = "rate(1 minute)"
}
resource "aws_cloudwatch_event_target" "spot_fleet_scaling" {
rule = aws_cloudwatch_event_rule.spot_fleet_scaling.name
target_id = "lambda"
arn = aws_lambda_function.spot_fleet_scaler.arn
}Monitoring and Alerting
CloudWatch dashboard for spot interruptions:
resource "aws_cloudwatch_dashboard" "spot_monitoring" {
dashboard_name = "spot-instance-monitoring"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/EC2", "StatusCheckFailed", { stat = "Sum" }],
[".", "SpotInstanceInterruptions", { stat = "Sum" }]
]
period = 300
stat = "Sum"
region = var.aws_region
title = "Spot Instance Interruptions"
}
},
{
type = "metric"
properties = {
metrics = [
["AWS/ECS", "CPUUtilization", { stat = "Average" }],
[".", "MemoryUtilization", { stat = "Average" }]
]
period = 300
stat = "Average"
region = var.aws_region
title = "ECS Cluster Resource Utilization"
}
},
{
type = "log"
properties = {
query = <<-EOQ
SOURCE '/aws/ecs/cluster/production'
| filter @message like /spot.*interrupt/
| stats count() by bin(5m)
EOQ
region = var.aws_region
title = "Spot Interruption Log Events"
}
}
]
})
}
# Alert on high interruption rate
resource "aws_cloudwatch_metric_alarm" "high_spot_interruption_rate" {
alarm_name = "high-spot-interruption-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "SpotInstanceInterruptions"
namespace = "AWS/EC2"
period = 300
statistic = "Sum"
threshold = 5
alarm_description = "Alert when spot interruptions exceed 5 in 5 minutes"
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Alert on insufficient capacity
resource "aws_cloudwatch_metric_alarm" "spot_capacity_shortage" {
alarm_name = "spot-capacity-shortage"
comparison_operator = "LessThanThreshold"
evaluation_periods = 3
metric_name = "TargetCapacity"
namespace = "AWS/EC2Spot"
period = 300
statistic = "Average"
threshold = 0.7 # Alert if less than 70% of target capacity fulfilled
alarm_description = "Alert when spot fleet cannot fulfill target capacity"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
FleetRequestId = aws_spot_fleet_request.batch_processing.id
}
}Decision Framework
Use this checklist to decide if Spot is right for your workload:
✅ Use Spot If:
- Workload is stateless or has checkpoint/resume capability
- Multiple instances/replicas provide redundancy
- Graceful shutdown can complete in under 2 minutes
- Application handles instance loss without user impact
- Monthly compute spend exceeds $5K
- Engineering team can invest 40+ hours in implementation
- You have monitoring and alerting infrastructure
❌ Don't Use Spot If:
- Workload is stateful (databases, cache, queues) without failover
- Single instance (no redundancy)
- Critical SLA requirements (<500ms P99 latency)
- Team lacks capacity for implementation and maintenance
- Compute spend is under $2K/month (ROI too low)
- Graceful shutdown impossible (e.g., long-lived WebSockets)
⚠️ Requires Extra Diligence:
- GPU instances (higher interruption rates, 5-10%)
- Kubernetes StatefulSets (requires pod disruption budgets)
- ML training (must implement checkpointing)
- Windows instances (slower boot times)
Real-World Case Study: API Migration to Spot
Company: B2B SaaS platform, $2M ARR Workload: RESTful API (Node.js), 40 ECS tasks, 10 m5.large instances Traffic: 50M requests/month, variable (2x peak during business hours)
Before (100% On-Demand)
10 × m5.large × $0.096/hour × 730 hours = $700.80/month
Annual: $8,410
After (70% Spot, 30% On-Demand)
On-Demand base: 3 × m5.large × $0.096/hour × 730 hours = $210.24/month
Spot capacity: 7 × m5.large × $0.0288/hour × 730 hours = $147.17/month
Total: $357.41/month
Annual: $4,289
Savings: $4,121/year (49%)
Implementation Details
Week 1: Architecture Review
- Confirmed API was stateless
- Verified connection draining configuration
- Updated health check to support graceful shutdown
Week 2: Spot Implementation
- Deployed spot interruption handler
- Created mixed ASG (3 on-demand, 7 spot capacity)
- Configured capacity-optimized allocation
Week 3: Testing
- Simulated spot interruptions in staging
- Verified task rescheduling
- Measured P99 latency impact (no regression)
Week 4: Production Rollout
- Gradual rollout: 25% → 50% → 70% spot
- Monitored interruption rates: 2-3% hourly
- No customer-reported issues
Results
Cost savings:
- $4,121/year (49% reduction)
- Engineering time: 60 hours
Operational impact:
- 15-20 spot interruptions per month
- Average recovery time: 45 seconds (automatic)
- Zero customer-facing incidents
- P99 latency unchanged: 180ms
Lessons learned:
- Diversify instance types: Using m5, m5a, m5n reduced interruption rate
- Capacity-optimized allocation works: Much better than lowest-price
- Start conservative: 30% spot, then increase confidence
- Monitor actively: Set up dashboards before migration
Conclusion: Spot for Production is Ready (If You Are)
In 2025, Spot Instances are production-viable for many workloads—but not all.
The key question isn't "Can I use Spot?" but "Should I use Spot?"
Use Spot if:
- Your workload is stateless or fault-tolerant
- You have redundancy built in
- You can implement graceful shutdown
- Your compute spend justifies the engineering effort
Skip Spot if:
- Your workload is stateful without failover
- You have tight SLA requirements
- Your team lacks bandwidth for implementation
- Your compute spend is too low (under $5K/month)
The middle path: Start with non-critical workloads (dev/staging, batch jobs) to build confidence, then migrate production workloads incrementally.
Spot Instances aren't magic—they're a trade-off. You're exchanging engineering time and operational complexity for significant cost savings. If you're spending over $10K/month on compute, that trade-off is usually worth it.
Action Items
- Audit your current compute spend by instance type and workload
- Identify stateless workloads suitable for spot (batch jobs, APIs, workers)
- Calculate potential savings using spot pricing history
- Implement graceful shutdown handlers for your applications
- Start with dev/staging environments to validate the approach
- Deploy spot for one production workload and monitor for 30 days
- Expand gradually as confidence builds
If you need help designing a spot migration strategy for your infrastructure, schedule a consultation. We'll analyze your workloads, calculate ROI, and provide a phased implementation plan that minimizes risk while maximizing savings.