Spot Instances for Production? A Risk/Reward Analysis

Key takeaways

Spot instances provide 70-90% savings over On-Demand but can be terminated with 2 minutes notice when AWS needs capacity
Production viability depends on workload characteristics: stateless batch jobs are ideal, stateful databases are unsuitable
Interruption rates vary by instance type and region (1-5% hourly in most cases, up to 20% during capacity crunches)
Combining spot with on-demand in auto-scaling groups, using capacity-optimized allocation, and implementing graceful shutdown handlers enables production use
Spot is production-ready for containerized web services, data processing, CI/CD, and rendering—but requires architectural investment in resilience

The Spot Instance Promise (and the Fine Print)

Your CFO just forwarded you the monthly AWS bill with one word: "Why?" Your ECS cluster is running 50 r6i.2xlarge instances 24/7 at $0.504/hour each = $18,144/month.

A junior engineer suggests: "Why don't we use Spot Instances? They're $0.0756/hour—we'd save $16,300/month."

Your first instinct is: "Spot Instances are for batch jobs, not production." But is that still true in 2025?

The reality: Spot Instances are production-viable for many workloads—if you architect for interruptions.

Understanding Spot Instance Economics

The Basic Model

AWS has excess capacity that varies by:

Instance type
Availability Zone
Time of day
Overall demand

Spot pricing:

Up to 90% cheaper than On-Demand
Price fluctuates based on supply/demand
AWS can reclaim instances with 2 minutes notice

Real-World Spot Pricing (us-east-1, January 2025)

Instance Type	On-Demand	Typical Spot	Savings	Interruption Rate
t3.medium	$0.0416/hr	$0.0125/hr	70%	<1%/hour
m5.large	$0.096/hr	$0.0288/hr	70%	2-3%/hour
c5.xlarge	$0.17/hr	$0.051/hr	70%	2-4%/hour
r6i.2xlarge	$0.504/hr	$0.0756/hr	85%	3-5%/hour
g4dn.xlarge (GPU)	$0.526/hr	$0.158/hr	70%	5-10%/hour

Key insight: Interruption rates are hourly, meaning a 3% interruption rate = ~72% chance of running a full day without interruption for a single instance.

The Hidden Costs

Spot isn't free savings—you pay with:

1. Engineering time:

Graceful shutdown handlers
State persistence logic
Monitoring and alerting
Runbook development

2. Infrastructure complexity:

Mixed instance types
Multiple AZs
Fallback to On-Demand
Capacity management

3. Operational overhead:

Responding to interruptions
Debugging spot-related failures
Capacity planning across instance families

ROI calculation:

Gross savings: $16,300/month
Engineering cost: 40 hours × $150/hour = $6,000 (one-time)
Ongoing ops: 5 hours/month × $150 = $750/month

First month net: $16,300 - $6,000 - $750 = $9,550
Ongoing monthly net: $16,300 - $750 = $15,550

Break-even: Month 1
Annual savings: $186,600 - $15,000 (ops) = $171,600

Verdict: If you're spending over $5K/month on compute, Spot is worth investigating.

Workload Suitability Analysis

✅ Excellent Fit for Spot

1. Stateless Web Services (with graceful shutdown)

Characteristics:

No local state
Load balanced across many instances
Can handle individual instance loss
Connection draining available

Example: API server behind ALB

20 instances in Auto Scaling Group
70% Spot, 30% On-Demand minimum
Instance termination = ALB drains connections over 120 seconds
ECS handles task rescheduling automatically

Annual savings: $130K+ for medium-sized API

2. Batch Processing / Data Pipelines

Characteristics:

Job-based workload
Idempotent operations
Checkpoint/resume capability
Failure retry logic

Example: Video transcoding pipeline

SQS queue with 10,000 jobs
Spot Fleet scales 0-100 instances
Job failure = message returns to queue
Another instance picks it up

Annual savings: $200K+ for high-volume processing

3. Containerized Microservices

Characteristics:

Container orchestration (ECS/EKS)
Service mesh or discovery
Multiple replicas
Health checks and auto-restart

Example: E-commerce checkout service

8 ECS tasks across 4 instances
Mix of Spot (75%) and On-Demand (25%)
Task draining on interruption
New tasks start on remaining capacity

Annual savings: $90K+ for typical microservices cluster

4. CI/CD Build Agents

Characteristics:

Ephemeral workloads
Build can restart on different host
No external SLA impact
High parallelism

Example: GitHub Actions self-hosted runners

50 Spot instances for parallel builds
Interruption = job re-queued
Build time increases slightly, cost drops 80%

Annual savings: $150K+ for active engineering org

5. Dev/Test/Staging Environments

Characteristics:

Non-production workloads
Tolerance for occasional downtime
Easy to recreate state
Lower availability requirements

Example: Staging environment

Mirrors production architecture
100% Spot instances
Interruption = redeploy (5 minutes)

Annual savings: $200K+ if staging mirrors production

⚠️ Possible with Careful Architecture

1. Kubernetes Worker Nodes

Challenges:

Pod rescheduling overhead
Potential for cascading failures
StatefulSets require special handling

Solution:

# Mix of node groups
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1
 
nodeGroups:
  # On-Demand for critical stateful workloads
  - name: on-demand-critical
    instanceType: m5.large
    desiredCapacity: 3
    minSize: 3
    maxSize: 10
    labels:
      workload-type: stateful
    taints:
      - key: workload-type
        value: stateful
        effect: NoSchedule
 
  # Spot for stateless services
  - name: spot-stateless
    instancesDistribution:
      instanceTypes:
        - m5.large
        - m5a.large
        - m5n.large
        - m5d.large
      onDemandBaseCapacity: 2
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: capacity-optimized
    desiredCapacity: 10
    minSize: 3
    maxSize: 50
    labels:
      workload-type: stateless

Workload deployment:

# Stateless API - tolerates spot interruptions
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 10
  template:
    spec:
      # Allow scheduling on spot nodes
      tolerations:
        - key: workload-type
          value: stateless
          effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: workload-type
                    operator: In
                    values:
                      - stateless
 
---
# Database - requires stable on-demand nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  template:
    spec:
      # Require on-demand nodes
      nodeSelector:
        workload-type: stateful
      tolerations:
        - key: workload-type
          value: stateful
          effect: NoSchedule

2. Long-Running ML Training

Challenges:

Training can take hours/days
Interruption = lost progress
GPU instances have higher interruption rates

Solution: Checkpointing

import torch
import boto3
import os
from datetime import datetime
 
class SpotCheckpointer:
    """Handles checkpointing for Spot instance interruptions"""
 
    def __init__(self, model, optimizer, s3_bucket, checkpoint_prefix):
        self.model = model
        self.optimizer = optimizer
        self.s3_bucket = s3_bucket
        self.checkpoint_prefix = checkpoint_prefix
        self.s3 = boto3.client('s3')
 
        # Monitor spot interruption notices
        self.interruption_url = 'http://169.254.169.254/latest/meta-data/spot/instance-action'
 
    def save_checkpoint(self, epoch, loss, metrics):
        """Save checkpoint locally and to S3"""
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'loss': loss,
            'metrics': metrics,
            'timestamp': datetime.utcnow().isoformat()
        }
 
        # Save locally first (fast)
        local_path = f'/tmp/checkpoint_epoch_{epoch}.pt'
        torch.save(checkpoint, local_path)
 
        # Upload to S3 (durable)
        s3_key = f'{self.checkpoint_prefix}/checkpoint_epoch_{epoch}.pt'
        self.s3.upload_file(local_path, self.s3_bucket, s3_key)
 
        print(f"Checkpoint saved: s3://{self.s3_bucket}/{s3_key}")
 
    def load_latest_checkpoint(self):
        """Load most recent checkpoint from S3"""
        # List all checkpoints
        response = self.s3.list_objects_v2(
            Bucket=self.s3_bucket,
            Prefix=self.checkpoint_prefix
        )
 
        if 'Contents' not in response:
            return None
 
        # Find latest checkpoint
        checkpoints = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)
        latest = checkpoints[0]['Key']
 
        # Download and load
        local_path = '/tmp/checkpoint_latest.pt'
        self.s3.download_file(self.s3_bucket, latest, local_path)
 
        checkpoint = torch.load(local_path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
 
        print(f"Resumed from epoch {checkpoint['epoch']}")
        return checkpoint
 
    def check_interruption_notice(self):
        """Check if instance is marked for termination"""
        import requests
        try:
            response = requests.get(self.interruption_url, timeout=1)
            if response.status_code == 200:
                # Interruption notice received
                return True
        except requests.exceptions.RequestException:
            # No interruption notice (404 is normal)
            pass
        return False
 
# Training loop with checkpointing
checkpointer = SpotCheckpointer(
    model=model,
    optimizer=optimizer,
    s3_bucket='ml-training-checkpoints',
    checkpoint_prefix='experiment-123'
)
 
# Resume from checkpoint if exists
checkpoint = checkpointer.load_latest_checkpoint()
start_epoch = checkpoint['epoch'] + 1 if checkpoint else 0
 
for epoch in range(start_epoch, num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        # Training step
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
 
        # Check for interruption every 100 batches
        if batch_idx % 100 == 0:
            if checkpointer.check_interruption_notice():
                print("Spot interruption notice received! Saving checkpoint...")
                checkpointer.save_checkpoint(epoch, loss.item(), metrics={})
                print("Checkpoint saved. Exiting gracefully.")
                exit(0)
 
    # Save checkpoint every epoch
    checkpointer.save_checkpoint(epoch, loss.item(), metrics={})

Result: Training survives interruptions with minimal loss (only current batch)

❌ Bad Fit for Spot

1. Stateful Databases (Primary)

Why not:

2-minute notice insufficient for clean shutdown
Risk of data corruption
Replication lag on failover
Customer-facing downtime

Exception: Read replicas can use Spot if:

Application handles replica unavailability
Multiple replicas available
Automated replica replacement

2. Single Points of Failure

Why not:

No redundancy = outage on interruption
Examples: Single NAT gateway, single Jenkins master, single Redis cache

Solution: Don't use Spot, or add redundancy first

3. Tight SLA Requirements

Why not:

P99 latency SLAs impacted by interruptions
Customer-facing impact
No fallback capacity

Example: Payment processing with <500ms SLA

Spot interruption = 2-minute degradation
Use On-Demand with Savings Plans instead

Implementation Patterns

Pattern 1: ECS with Mixed Capacity

Terraform configuration:

resource "aws_ecs_cluster" "main" {
  name = "production"
 
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}
 
# Capacity provider: On-Demand base
resource "aws_ecs_capacity_provider" "on_demand" {
  name = "on-demand-provider"
 
  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.on_demand.arn
 
    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 10
    }
  }
}
 
resource "aws_autoscaling_group" "on_demand" {
  name                = "ecs-on-demand"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = 2
  max_size            = 20
  desired_capacity    = 2
 
  launch_template {
    id      = aws_launch_template.on_demand.id
    version = "$Latest"
  }
 
  tag {
    key                 = "Name"
    value               = "ecs-on-demand"
    propagate_at_launch = true
  }
}
 
resource "aws_launch_template" "on_demand" {
  name_prefix   = "ecs-on-demand-"
  image_id      = data.aws_ami.ecs_optimized.id
  instance_type = "m5.large"
 
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs.name
  }
 
  user_data = base64encode(<<-EOF
    #!/bin/bash
    echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
    echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
  EOF
  )
 
  lifecycle {
    create_before_destroy = true
  }
}
 
# Capacity provider: Spot instances (multiple instance types)
resource "aws_ecs_capacity_provider" "spot" {
  name = "spot-provider"
 
  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.spot.arn
 
    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 10
    }
  }
}
 
resource "aws_autoscaling_group" "spot" {
  name                = "ecs-spot"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = 0
  max_size            = 50
  desired_capacity    = 8
 
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 0
      spot_allocation_strategy                 = "capacity-optimized"
      spot_instance_pools                      = 0
    }
 
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.spot.id
        version            = "$Latest"
      }
 
      # Multiple instance types for diversification
      override {
        instance_type = "m5.large"
      }
      override {
        instance_type = "m5a.large"
      }
      override {
        instance_type = "m5n.large"
      }
      override {
        instance_type = "m5d.large"
      }
    }
  }
 
  tag {
    key                 = "Name"
    value               = "ecs-spot"
    propagate_at_launch = true
  }
}
 
resource "aws_launch_template" "spot" {
  name_prefix   = "ecs-spot-"
  image_id      = data.aws_ami.ecs_optimized.id
  instance_type = "m5.large"  # Default, overridden by mixed_instances_policy
 
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs.name
  }
 
  user_data = base64encode(<<-EOF
    #!/bin/bash
    echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
    echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
  EOF
  )
 
  lifecycle {
    create_before_destroy = true
  }
}
 
# Cluster capacity provider strategy
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name
 
  capacity_providers = [
    aws_ecs_capacity_provider.on_demand.name,
    aws_ecs_capacity_provider.spot.name
  ]
 
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.on_demand.name
    weight            = 1
    base              = 2  # Minimum 2 on-demand instances
  }
 
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.spot.name
    weight            = 4  # 80% of scaling goes to spot
    base              = 0
  }
}
 
# ECS Service configuration
resource "aws_ecs_service" "api" {
  name            = "api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 10
 
  # Use cluster's default capacity provider strategy
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.on_demand.name
    weight            = 1
    base              = 2
  }
 
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.spot.name
    weight            = 4
    base              = 0
  }
 
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }
 
  # Enable connection draining
  deployment_configuration {
    minimum_healthy_percent = 50
    maximum_percent         = 200
  }
}

Key features:

2 on-demand instances (base capacity)
Spot instances handle 80% of scale-out
Multiple instance types for availability
capacity-optimized allocation strategy
Automatic spot draining enabled

Pattern 2: Graceful Shutdown Handler

Node.js application with spot handling:

import express from 'express';
import http from 'http';
import axios from 'axios';
 
const app = express();
const server = http.createServer(app);
 
let isShuttingDown = false;
 
// Health check endpoint
app.get('/health', (req, res) => {
  if (isShuttingDown) {
    // Return 503 to fail ALB health checks
    // This removes instance from load balancer
    res.status(503).json({ status: 'shutting_down' });
  } else {
    res.json({ status: 'healthy' });
  }
});
 
// Application routes
app.get('/api/users', async (req, res) => {
  // Your business logic
  res.json({ users: [] });
});
 
// Graceful shutdown handler
async function gracefulShutdown(signal: string) {
  console.log(`${signal} received. Starting graceful shutdown...`);
 
  isShuttingDown = true;
 
  // 1. Stop accepting new connections
  server.close(() => {
    console.log('HTTP server closed');
  });
 
  // 2. Wait for ALB to drain connections (typically 30-120 seconds)
  console.log('Waiting 30 seconds for connection draining...');
  await new Promise(resolve => setTimeout(resolve, 30000));
 
  // 3. Close database connections
  console.log('Closing database connections...');
  await db.close();
 
  // 4. Flush metrics
  console.log('Flushing metrics...');
  await metrics.flush();
 
  console.log('Graceful shutdown complete');
  process.exit(0);
}
 
// Handle termination signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
 
// Spot interruption monitor (runs every 5 seconds)
async function monitorSpotInterruption() {
  try {
    const response = await axios.get(
      'http://169.254.169.254/latest/meta-data/spot/instance-action',
      { timeout: 1000 }
    );
 
    if (response.status === 200) {
      const interruptionData = response.data;
      console.log('Spot interruption notice received:', interruptionData);
 
      // Trigger graceful shutdown immediately
      await gracefulShutdown('SPOT_INTERRUPTION');
    }
  } catch (error) {
    // 404 means no interruption notice (normal)
    // Other errors are connection issues (ignore)
  }
}
 
// Check for interruption every 5 seconds
setInterval(monitorSpotInterruption, 5000);
 
const PORT = process.env.PORT || 8080;
server.listen(PORT, () => {
  console.log(`Server running on port ${PORT}`);
});

What this does:

Monitors instance metadata for spot interruption notices
Fails health checks when interruption detected
Waits 30 seconds for ALB to drain connections
Closes resources cleanly (DB, metrics)
Exits gracefully allowing ECS to reschedule task

Pattern 3: Spot Fleet for Batch Processing

Lambda function triggering Spot Fleet:

import { EC2 } from 'aws-sdk';
import { SQS } from 'aws-sdk';
 
const ec2 = new EC2();
const sqs = new SQS();
 
interface SpotFleetConfig {
  queueUrl: string;
  targetQueueDepth: number;
  spotFleetRequestId: string;
  minCapacity: number;
  maxCapacity: number;
}
 
export async function handler(event: any) {
  const config: SpotFleetConfig = {
    queueUrl: process.env.QUEUE_URL!,
    targetQueueDepth: 100, // Target 100 messages per instance
    spotFleetRequestId: process.env.SPOT_FLEET_ID!,
    minCapacity: 0,
    maxCapacity: 50
  };
 
  // 1. Get current queue depth
  const queueAttrs = await sqs.getQueueAttributes({
    QueueUrl: config.queueUrl,
    AttributeNames: ['ApproximateNumberOfMessages']
  }).promise();
 
  const queueDepth = parseInt(
    queueAttrs.Attributes?.ApproximateNumberOfMessages || '0'
  );
 
  // 2. Calculate desired capacity
  const desiredCapacity = Math.min(
    Math.max(
      Math.ceil(queueDepth / config.targetQueueDepth),
      config.minCapacity
    ),
    config.maxCapacity
  );
 
  console.log(`Queue depth: ${queueDepth}, Desired capacity: ${desiredCapacity}`);
 
  // 3. Update Spot Fleet target capacity
  await ec2.modifySpotFleetRequest({
    SpotFleetRequestId: config.spotFleetRequestId,
    TargetCapacity: desiredCapacity
  }).promise();
 
  return {
    statusCode: 200,
    body: JSON.stringify({
      queueDepth,
      desiredCapacity,
      timestamp: new Date().toISOString()
    })
  };
}

Spot Fleet configuration:

resource "aws_spot_fleet_request" "batch_processing" {
  iam_fleet_role      = aws_iam_role.spot_fleet.arn
  target_capacity     = 0  # Controlled by Lambda
  allocation_strategy = "capacityOptimized"
  fleet_type          = "maintain"
 
  # Terminate instances with no tasks
  terminate_instances_with_expiration = true
 
  launch_specification {
    ami                    = data.aws_ami.ecs_optimized.id
    instance_type          = "c5.2xlarge"
    spot_price             = "0.20"  # Max price (current spot ~$0.10)
    vpc_security_group_ids = [aws_security_group.batch.id]
    subnet_id              = var.private_subnet_ids[0]
 
    iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
 
    user_data = base64encode(<<-EOF
      #!/bin/bash
      # Start batch processor
      docker run -d \
        -e QUEUE_URL=${aws_sqs_queue.jobs.url} \
        -e AWS_REGION=${var.aws_region} \
        batch-processor:latest
    EOF
    )
  }
 
  # Add multiple instance types for availability
  launch_specification {
    ami                    = data.aws_ami.ecs_optimized.id
    instance_type          = "c5a.2xlarge"
    spot_price             = "0.20"
    vpc_security_group_ids = [aws_security_group.batch.id]
    subnet_id              = var.private_subnet_ids[0]
    iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
    user_data = base64encode(<<-EOF
      #!/bin/bash
      docker run -d \
        -e QUEUE_URL=${aws_sqs_queue.jobs.url} \
        -e AWS_REGION=${var.aws_region} \
        batch-processor:latest
    EOF
    )
  }
 
  launch_specification {
    ami                    = data.aws_ami.ecs_optimized.id
    instance_type          = "c5n.2xlarge"
    spot_price             = "0.20"
    vpc_security_group_ids = [aws_security_group.batch.id]
    subnet_id              = var.private_subnet_ids[0]
    iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
    user_data = base64encode(<<-EOF
      #!/bin/bash
      docker run -d \
        -e QUEUE_URL=${aws_sqs_queue.jobs.url} \
        -e AWS_REGION=${var.aws_region} \
        batch-processor:latest
    EOF
    )
  }
}
 
# Lambda to scale Spot Fleet based on queue depth
resource "aws_lambda_function" "spot_fleet_scaler" {
  filename      = "spot_fleet_scaler.zip"
  function_name = "spot-fleet-autoscaler"
  role          = aws_iam_role.lambda.arn
  handler       = "index.handler"
  runtime       = "nodejs18.x"
  timeout       = 60
 
  environment {
    variables = {
      QUEUE_URL      = aws_sqs_queue.jobs.url
      SPOT_FLEET_ID  = aws_spot_fleet_request.batch_processing.id
    }
  }
}
 
# Trigger Lambda every minute
resource "aws_cloudwatch_event_rule" "spot_fleet_scaling" {
  name                = "spot-fleet-scaling"
  schedule_expression = "rate(1 minute)"
}
 
resource "aws_cloudwatch_event_target" "spot_fleet_scaling" {
  rule      = aws_cloudwatch_event_rule.spot_fleet_scaling.name
  target_id = "lambda"
  arn       = aws_lambda_function.spot_fleet_scaler.arn
}

Monitoring and Alerting

CloudWatch dashboard for spot interruptions:

resource "aws_cloudwatch_dashboard" "spot_monitoring" {
  dashboard_name = "spot-instance-monitoring"
 
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/EC2", "StatusCheckFailed", { stat = "Sum" }],
            [".", "SpotInstanceInterruptions", { stat = "Sum" }]
          ]
          period = 300
          stat   = "Sum"
          region = var.aws_region
          title  = "Spot Instance Interruptions"
        }
      },
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/ECS", "CPUUtilization", { stat = "Average" }],
            [".", "MemoryUtilization", { stat = "Average" }]
          ]
          period = 300
          stat   = "Average"
          region = var.aws_region
          title  = "ECS Cluster Resource Utilization"
        }
      },
      {
        type = "log"
        properties = {
          query = <<-EOQ
            SOURCE '/aws/ecs/cluster/production'
            | filter @message like /spot.*interrupt/
            | stats count() by bin(5m)
          EOQ
          region = var.aws_region
          title  = "Spot Interruption Log Events"
        }
      }
    ]
  })
}
 
# Alert on high interruption rate
resource "aws_cloudwatch_metric_alarm" "high_spot_interruption_rate" {
  alarm_name          = "high-spot-interruption-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "SpotInstanceInterruptions"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "Alert when spot interruptions exceed 5 in 5 minutes"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}
 
# Alert on insufficient capacity
resource "aws_cloudwatch_metric_alarm" "spot_capacity_shortage" {
  alarm_name          = "spot-capacity-shortage"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 3
  metric_name         = "TargetCapacity"
  namespace           = "AWS/EC2Spot"
  period              = 300
  statistic           = "Average"
  threshold           = 0.7  # Alert if less than 70% of target capacity fulfilled
  alarm_description   = "Alert when spot fleet cannot fulfill target capacity"
  alarm_actions       = [aws_sns_topic.alerts.arn]
 
  dimensions = {
    FleetRequestId = aws_spot_fleet_request.batch_processing.id
  }
}

Decision Framework

Use this checklist to decide if Spot is right for your workload:

✅ Use Spot If:

Workload is stateless or has checkpoint/resume capability
Multiple instances/replicas provide redundancy
Graceful shutdown can complete in under 2 minutes
Application handles instance loss without user impact
Monthly compute spend exceeds $5K
Engineering team can invest 40+ hours in implementation
You have monitoring and alerting infrastructure

❌ Don't Use Spot If:

Workload is stateful (databases, cache, queues) without failover
Single instance (no redundancy)
Critical SLA requirements (<500ms P99 latency)
Team lacks capacity for implementation and maintenance
Compute spend is under $2K/month (ROI too low)
Graceful shutdown impossible (e.g., long-lived WebSockets)

⚠️ Requires Extra Diligence:

GPU instances (higher interruption rates, 5-10%)
Kubernetes StatefulSets (requires pod disruption budgets)
ML training (must implement checkpointing)
Windows instances (slower boot times)

Real-World Case Study: API Migration to Spot

Company: B2B SaaS platform, $2M ARR Workload: RESTful API (Node.js), 40 ECS tasks, 10 m5.large instances Traffic: 50M requests/month, variable (2x peak during business hours)

Before (100% On-Demand)

10 × m5.large × $0.096/hour × 730 hours = $700.80/month
Annual: $8,410

After (70% Spot, 30% On-Demand)

On-Demand base: 3 × m5.large × $0.096/hour × 730 hours = $210.24/month
Spot capacity: 7 × m5.large × $0.0288/hour × 730 hours = $147.17/month

Total: $357.41/month
Annual: $4,289
Savings: $4,121/year (49%)

Implementation Details

Week 1: Architecture Review

Confirmed API was stateless
Verified connection draining configuration
Updated health check to support graceful shutdown

Week 2: Spot Implementation

Deployed spot interruption handler
Created mixed ASG (3 on-demand, 7 spot capacity)
Configured capacity-optimized allocation

Week 3: Testing

Simulated spot interruptions in staging
Verified task rescheduling
Measured P99 latency impact (no regression)

Week 4: Production Rollout

Gradual rollout: 25% → 50% → 70% spot
Monitored interruption rates: 2-3% hourly
No customer-reported issues

Results

Cost savings:

$4,121/year (49% reduction)
Engineering time: 60 hours

Operational impact:

15-20 spot interruptions per month
Average recovery time: 45 seconds (automatic)
Zero customer-facing incidents
P99 latency unchanged: 180ms

Lessons learned:

Diversify instance types: Using m5, m5a, m5n reduced interruption rate
Capacity-optimized allocation works: Much better than lowest-price
Start conservative: 30% spot, then increase confidence
Monitor actively: Set up dashboards before migration

Conclusion: Spot for Production is Ready (If You Are)

In 2025, Spot Instances are production-viable for many workloads—but not all.

The key question isn't "Can I use Spot?" but "Should I use Spot?"

Use Spot if:

Your workload is stateless or fault-tolerant
You have redundancy built in
You can implement graceful shutdown
Your compute spend justifies the engineering effort

Skip Spot if:

Your workload is stateful without failover
You have tight SLA requirements
Your team lacks bandwidth for implementation
Your compute spend is too low (under $5K/month)

The middle path: Start with non-critical workloads (dev/staging, batch jobs) to build confidence, then migrate production workloads incrementally.

Spot Instances aren't magic—they're a trade-off. You're exchanging engineering time and operational complexity for significant cost savings. If you're spending over $10K/month on compute, that trade-off is usually worth it.

Action Items

Audit your current compute spend by instance type and workload
Identify stateless workloads suitable for spot (batch jobs, APIs, workers)
Calculate potential savings using spot pricing history
Implement graceful shutdown handlers for your applications
Start with dev/staging environments to validate the approach
Deploy spot for one production workload and monitor for 30 days
Expand gradually as confidence builds

If you need help designing a spot migration strategy for your infrastructure, schedule a consultation. We'll analyze your workloads, calculate ROI, and provide a phased implementation plan that minimizes risk while maximizing savings.

Key takeaways

The Spot Instance Promise (and the Fine Print)

Understanding Spot Instance Economics

The Basic Model

Real-World Spot Pricing (us-east-1, January 2025)

The Hidden Costs

Workload Suitability Analysis

✅ Excellent Fit for Spot

⚠️ Possible with Careful Architecture

❌ Bad Fit for Spot

Implementation Patterns

Pattern 1: ECS with Mixed Capacity

Pattern 2: Graceful Shutdown Handler

Pattern 3: Spot Fleet for Batch Processing

Monitoring and Alerting

Decision Framework

✅ Use Spot If:

❌ Don't Use Spot If:

⚠️ Requires Extra Diligence:

Real-World Case Study: API Migration to Spot

Before (100% On-Demand)

After (70% Spot, 30% On-Demand)

Implementation Details

Results

Conclusion: Spot for Production is Ready (If You Are)

Action Items

Need Help with Your Cloud Infrastructure?