Cost Optimization

Spot Instances for Production? A Risk/Reward Analysis

Updated By Zak Kann

Key takeaways

  • Spot instances provide 70-90% savings over On-Demand but can be terminated with 2 minutes notice when AWS needs capacity
  • Production viability depends on workload characteristics: stateless batch jobs are ideal, stateful databases are unsuitable
  • Interruption rates vary by instance type and region (1-5% hourly in most cases, up to 20% during capacity crunches)
  • Combining spot with on-demand in auto-scaling groups, using capacity-optimized allocation, and implementing graceful shutdown handlers enables production use
  • Spot is production-ready for containerized web services, data processing, CI/CD, and rendering—but requires architectural investment in resilience

The Spot Instance Promise (and the Fine Print)

Your CFO just forwarded you the monthly AWS bill with one word: "Why?" Your ECS cluster is running 50 r6i.2xlarge instances 24/7 at $0.504/hour each = $18,144/month.

A junior engineer suggests: "Why don't we use Spot Instances? They're $0.0756/hour—we'd save $16,300/month."

Your first instinct is: "Spot Instances are for batch jobs, not production." But is that still true in 2025?

The reality: Spot Instances are production-viable for many workloads—if you architect for interruptions.

Understanding Spot Instance Economics

The Basic Model

AWS has excess capacity that varies by:

  • Instance type
  • Availability Zone
  • Time of day
  • Overall demand

Spot pricing:

  • Up to 90% cheaper than On-Demand
  • Price fluctuates based on supply/demand
  • AWS can reclaim instances with 2 minutes notice

Real-World Spot Pricing (us-east-1, January 2025)

Instance TypeOn-DemandTypical SpotSavingsInterruption Rate
t3.medium$0.0416/hr$0.0125/hr70%<1%/hour
m5.large$0.096/hr$0.0288/hr70%2-3%/hour
c5.xlarge$0.17/hr$0.051/hr70%2-4%/hour
r6i.2xlarge$0.504/hr$0.0756/hr85%3-5%/hour
g4dn.xlarge (GPU)$0.526/hr$0.158/hr70%5-10%/hour

Key insight: Interruption rates are hourly, meaning a 3% interruption rate = ~72% chance of running a full day without interruption for a single instance.

The Hidden Costs

Spot isn't free savings—you pay with:

1. Engineering time:

  • Graceful shutdown handlers
  • State persistence logic
  • Monitoring and alerting
  • Runbook development

2. Infrastructure complexity:

  • Mixed instance types
  • Multiple AZs
  • Fallback to On-Demand
  • Capacity management

3. Operational overhead:

  • Responding to interruptions
  • Debugging spot-related failures
  • Capacity planning across instance families

ROI calculation:

Gross savings: $16,300/month
Engineering cost: 40 hours × $150/hour = $6,000 (one-time)
Ongoing ops: 5 hours/month × $150 = $750/month

First month net: $16,300 - $6,000 - $750 = $9,550
Ongoing monthly net: $16,300 - $750 = $15,550

Break-even: Month 1
Annual savings: $186,600 - $15,000 (ops) = $171,600

Verdict: If you're spending over $5K/month on compute, Spot is worth investigating.

Workload Suitability Analysis

✅ Excellent Fit for Spot

1. Stateless Web Services (with graceful shutdown)

Characteristics:

  • No local state
  • Load balanced across many instances
  • Can handle individual instance loss
  • Connection draining available

Example: API server behind ALB

  • 20 instances in Auto Scaling Group
  • 70% Spot, 30% On-Demand minimum
  • Instance termination = ALB drains connections over 120 seconds
  • ECS handles task rescheduling automatically

Annual savings: $130K+ for medium-sized API

2. Batch Processing / Data Pipelines

Characteristics:

  • Job-based workload
  • Idempotent operations
  • Checkpoint/resume capability
  • Failure retry logic

Example: Video transcoding pipeline

  • SQS queue with 10,000 jobs
  • Spot Fleet scales 0-100 instances
  • Job failure = message returns to queue
  • Another instance picks it up

Annual savings: $200K+ for high-volume processing

3. Containerized Microservices

Characteristics:

  • Container orchestration (ECS/EKS)
  • Service mesh or discovery
  • Multiple replicas
  • Health checks and auto-restart

Example: E-commerce checkout service

  • 8 ECS tasks across 4 instances
  • Mix of Spot (75%) and On-Demand (25%)
  • Task draining on interruption
  • New tasks start on remaining capacity

Annual savings: $90K+ for typical microservices cluster

4. CI/CD Build Agents

Characteristics:

  • Ephemeral workloads
  • Build can restart on different host
  • No external SLA impact
  • High parallelism

Example: GitHub Actions self-hosted runners

  • 50 Spot instances for parallel builds
  • Interruption = job re-queued
  • Build time increases slightly, cost drops 80%

Annual savings: $150K+ for active engineering org

5. Dev/Test/Staging Environments

Characteristics:

  • Non-production workloads
  • Tolerance for occasional downtime
  • Easy to recreate state
  • Lower availability requirements

Example: Staging environment

  • Mirrors production architecture
  • 100% Spot instances
  • Interruption = redeploy (5 minutes)

Annual savings: $200K+ if staging mirrors production

⚠️ Possible with Careful Architecture

1. Kubernetes Worker Nodes

Challenges:

  • Pod rescheduling overhead
  • Potential for cascading failures
  • StatefulSets require special handling

Solution:

# Mix of node groups
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1
 
nodeGroups:
  # On-Demand for critical stateful workloads
  - name: on-demand-critical
    instanceType: m5.large
    desiredCapacity: 3
    minSize: 3
    maxSize: 10
    labels:
      workload-type: stateful
    taints:
      - key: workload-type
        value: stateful
        effect: NoSchedule
 
  # Spot for stateless services
  - name: spot-stateless
    instancesDistribution:
      instanceTypes:
        - m5.large
        - m5a.large
        - m5n.large
        - m5d.large
      onDemandBaseCapacity: 2
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: capacity-optimized
    desiredCapacity: 10
    minSize: 3
    maxSize: 50
    labels:
      workload-type: stateless

Workload deployment:

# Stateless API - tolerates spot interruptions
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 10
  template:
    spec:
      # Allow scheduling on spot nodes
      tolerations:
        - key: workload-type
          value: stateless
          effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: workload-type
                    operator: In
                    values:
                      - stateless
 
---
# Database - requires stable on-demand nodes
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  template:
    spec:
      # Require on-demand nodes
      nodeSelector:
        workload-type: stateful
      tolerations:
        - key: workload-type
          value: stateful
          effect: NoSchedule

2. Long-Running ML Training

Challenges:

  • Training can take hours/days
  • Interruption = lost progress
  • GPU instances have higher interruption rates

Solution: Checkpointing

import torch
import boto3
import os
from datetime import datetime
 
class SpotCheckpointer:
    """Handles checkpointing for Spot instance interruptions"""
 
    def __init__(self, model, optimizer, s3_bucket, checkpoint_prefix):
        self.model = model
        self.optimizer = optimizer
        self.s3_bucket = s3_bucket
        self.checkpoint_prefix = checkpoint_prefix
        self.s3 = boto3.client('s3')
 
        # Monitor spot interruption notices
        self.interruption_url = 'http://169.254.169.254/latest/meta-data/spot/instance-action'
 
    def save_checkpoint(self, epoch, loss, metrics):
        """Save checkpoint locally and to S3"""
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'loss': loss,
            'metrics': metrics,
            'timestamp': datetime.utcnow().isoformat()
        }
 
        # Save locally first (fast)
        local_path = f'/tmp/checkpoint_epoch_{epoch}.pt'
        torch.save(checkpoint, local_path)
 
        # Upload to S3 (durable)
        s3_key = f'{self.checkpoint_prefix}/checkpoint_epoch_{epoch}.pt'
        self.s3.upload_file(local_path, self.s3_bucket, s3_key)
 
        print(f"Checkpoint saved: s3://{self.s3_bucket}/{s3_key}")
 
    def load_latest_checkpoint(self):
        """Load most recent checkpoint from S3"""
        # List all checkpoints
        response = self.s3.list_objects_v2(
            Bucket=self.s3_bucket,
            Prefix=self.checkpoint_prefix
        )
 
        if 'Contents' not in response:
            return None
 
        # Find latest checkpoint
        checkpoints = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)
        latest = checkpoints[0]['Key']
 
        # Download and load
        local_path = '/tmp/checkpoint_latest.pt'
        self.s3.download_file(self.s3_bucket, latest, local_path)
 
        checkpoint = torch.load(local_path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
 
        print(f"Resumed from epoch {checkpoint['epoch']}")
        return checkpoint
 
    def check_interruption_notice(self):
        """Check if instance is marked for termination"""
        import requests
        try:
            response = requests.get(self.interruption_url, timeout=1)
            if response.status_code == 200:
                # Interruption notice received
                return True
        except requests.exceptions.RequestException:
            # No interruption notice (404 is normal)
            pass
        return False
 
# Training loop with checkpointing
checkpointer = SpotCheckpointer(
    model=model,
    optimizer=optimizer,
    s3_bucket='ml-training-checkpoints',
    checkpoint_prefix='experiment-123'
)
 
# Resume from checkpoint if exists
checkpoint = checkpointer.load_latest_checkpoint()
start_epoch = checkpoint['epoch'] + 1 if checkpoint else 0
 
for epoch in range(start_epoch, num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        # Training step
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
 
        # Check for interruption every 100 batches
        if batch_idx % 100 == 0:
            if checkpointer.check_interruption_notice():
                print("Spot interruption notice received! Saving checkpoint...")
                checkpointer.save_checkpoint(epoch, loss.item(), metrics={})
                print("Checkpoint saved. Exiting gracefully.")
                exit(0)
 
    # Save checkpoint every epoch
    checkpointer.save_checkpoint(epoch, loss.item(), metrics={})

Result: Training survives interruptions with minimal loss (only current batch)

❌ Bad Fit for Spot

1. Stateful Databases (Primary)

Why not:

  • 2-minute notice insufficient for clean shutdown
  • Risk of data corruption
  • Replication lag on failover
  • Customer-facing downtime

Exception: Read replicas can use Spot if:

  • Application handles replica unavailability
  • Multiple replicas available
  • Automated replica replacement

2. Single Points of Failure

Why not:

  • No redundancy = outage on interruption
  • Examples: Single NAT gateway, single Jenkins master, single Redis cache

Solution: Don't use Spot, or add redundancy first

3. Tight SLA Requirements

Why not:

  • P99 latency SLAs impacted by interruptions
  • Customer-facing impact
  • No fallback capacity

Example: Payment processing with <500ms SLA

  • Spot interruption = 2-minute degradation
  • Use On-Demand with Savings Plans instead

Implementation Patterns

Pattern 1: ECS with Mixed Capacity

Terraform configuration:

resource "aws_ecs_cluster" "main" {
  name = "production"
 
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}
 
# Capacity provider: On-Demand base
resource "aws_ecs_capacity_provider" "on_demand" {
  name = "on-demand-provider"
 
  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.on_demand.arn
 
    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 10
    }
  }
}
 
resource "aws_autoscaling_group" "on_demand" {
  name                = "ecs-on-demand"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = 2
  max_size            = 20
  desired_capacity    = 2
 
  launch_template {
    id      = aws_launch_template.on_demand.id
    version = "$Latest"
  }
 
  tag {
    key                 = "Name"
    value               = "ecs-on-demand"
    propagate_at_launch = true
  }
}
 
resource "aws_launch_template" "on_demand" {
  name_prefix   = "ecs-on-demand-"
  image_id      = data.aws_ami.ecs_optimized.id
  instance_type = "m5.large"
 
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs.name
  }
 
  user_data = base64encode(<<-EOF
    #!/bin/bash
    echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
    echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
  EOF
  )
 
  lifecycle {
    create_before_destroy = true
  }
}
 
# Capacity provider: Spot instances (multiple instance types)
resource "aws_ecs_capacity_provider" "spot" {
  name = "spot-provider"
 
  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.spot.arn
 
    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 10
    }
  }
}
 
resource "aws_autoscaling_group" "spot" {
  name                = "ecs-spot"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = 0
  max_size            = 50
  desired_capacity    = 8
 
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 0
      spot_allocation_strategy                 = "capacity-optimized"
      spot_instance_pools                      = 0
    }
 
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.spot.id
        version            = "$Latest"
      }
 
      # Multiple instance types for diversification
      override {
        instance_type = "m5.large"
      }
      override {
        instance_type = "m5a.large"
      }
      override {
        instance_type = "m5n.large"
      }
      override {
        instance_type = "m5d.large"
      }
    }
  }
 
  tag {
    key                 = "Name"
    value               = "ecs-spot"
    propagate_at_launch = true
  }
}
 
resource "aws_launch_template" "spot" {
  name_prefix   = "ecs-spot-"
  image_id      = data.aws_ami.ecs_optimized.id
  instance_type = "m5.large"  # Default, overridden by mixed_instances_policy
 
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs.name
  }
 
  user_data = base64encode(<<-EOF
    #!/bin/bash
    echo ECS_CLUSTER=${aws_ecs_cluster.main.name} >> /etc/ecs/ecs.config
    echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
  EOF
  )
 
  lifecycle {
    create_before_destroy = true
  }
}
 
# Cluster capacity provider strategy
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name
 
  capacity_providers = [
    aws_ecs_capacity_provider.on_demand.name,
    aws_ecs_capacity_provider.spot.name
  ]
 
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.on_demand.name
    weight            = 1
    base              = 2  # Minimum 2 on-demand instances
  }
 
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.spot.name
    weight            = 4  # 80% of scaling goes to spot
    base              = 0
  }
}
 
# ECS Service configuration
resource "aws_ecs_service" "api" {
  name            = "api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 10
 
  # Use cluster's default capacity provider strategy
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.on_demand.name
    weight            = 1
    base              = 2
  }
 
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.spot.name
    weight            = 4
    base              = 0
  }
 
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }
 
  # Enable connection draining
  deployment_configuration {
    minimum_healthy_percent = 50
    maximum_percent         = 200
  }
}

Key features:

  • 2 on-demand instances (base capacity)
  • Spot instances handle 80% of scale-out
  • Multiple instance types for availability
  • capacity-optimized allocation strategy
  • Automatic spot draining enabled

Pattern 2: Graceful Shutdown Handler

Node.js application with spot handling:

import express from 'express';
import http from 'http';
import axios from 'axios';
 
const app = express();
const server = http.createServer(app);
 
let isShuttingDown = false;
 
// Health check endpoint
app.get('/health', (req, res) => {
  if (isShuttingDown) {
    // Return 503 to fail ALB health checks
    // This removes instance from load balancer
    res.status(503).json({ status: 'shutting_down' });
  } else {
    res.json({ status: 'healthy' });
  }
});
 
// Application routes
app.get('/api/users', async (req, res) => {
  // Your business logic
  res.json({ users: [] });
});
 
// Graceful shutdown handler
async function gracefulShutdown(signal: string) {
  console.log(`${signal} received. Starting graceful shutdown...`);
 
  isShuttingDown = true;
 
  // 1. Stop accepting new connections
  server.close(() => {
    console.log('HTTP server closed');
  });
 
  // 2. Wait for ALB to drain connections (typically 30-120 seconds)
  console.log('Waiting 30 seconds for connection draining...');
  await new Promise(resolve => setTimeout(resolve, 30000));
 
  // 3. Close database connections
  console.log('Closing database connections...');
  await db.close();
 
  // 4. Flush metrics
  console.log('Flushing metrics...');
  await metrics.flush();
 
  console.log('Graceful shutdown complete');
  process.exit(0);
}
 
// Handle termination signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
 
// Spot interruption monitor (runs every 5 seconds)
async function monitorSpotInterruption() {
  try {
    const response = await axios.get(
      'http://169.254.169.254/latest/meta-data/spot/instance-action',
      { timeout: 1000 }
    );
 
    if (response.status === 200) {
      const interruptionData = response.data;
      console.log('Spot interruption notice received:', interruptionData);
 
      // Trigger graceful shutdown immediately
      await gracefulShutdown('SPOT_INTERRUPTION');
    }
  } catch (error) {
    // 404 means no interruption notice (normal)
    // Other errors are connection issues (ignore)
  }
}
 
// Check for interruption every 5 seconds
setInterval(monitorSpotInterruption, 5000);
 
const PORT = process.env.PORT || 8080;
server.listen(PORT, () => {
  console.log(`Server running on port ${PORT}`);
});

What this does:

  1. Monitors instance metadata for spot interruption notices
  2. Fails health checks when interruption detected
  3. Waits 30 seconds for ALB to drain connections
  4. Closes resources cleanly (DB, metrics)
  5. Exits gracefully allowing ECS to reschedule task

Pattern 3: Spot Fleet for Batch Processing

Lambda function triggering Spot Fleet:

import { EC2 } from 'aws-sdk';
import { SQS } from 'aws-sdk';
 
const ec2 = new EC2();
const sqs = new SQS();
 
interface SpotFleetConfig {
  queueUrl: string;
  targetQueueDepth: number;
  spotFleetRequestId: string;
  minCapacity: number;
  maxCapacity: number;
}
 
export async function handler(event: any) {
  const config: SpotFleetConfig = {
    queueUrl: process.env.QUEUE_URL!,
    targetQueueDepth: 100, // Target 100 messages per instance
    spotFleetRequestId: process.env.SPOT_FLEET_ID!,
    minCapacity: 0,
    maxCapacity: 50
  };
 
  // 1. Get current queue depth
  const queueAttrs = await sqs.getQueueAttributes({
    QueueUrl: config.queueUrl,
    AttributeNames: ['ApproximateNumberOfMessages']
  }).promise();
 
  const queueDepth = parseInt(
    queueAttrs.Attributes?.ApproximateNumberOfMessages || '0'
  );
 
  // 2. Calculate desired capacity
  const desiredCapacity = Math.min(
    Math.max(
      Math.ceil(queueDepth / config.targetQueueDepth),
      config.minCapacity
    ),
    config.maxCapacity
  );
 
  console.log(`Queue depth: ${queueDepth}, Desired capacity: ${desiredCapacity}`);
 
  // 3. Update Spot Fleet target capacity
  await ec2.modifySpotFleetRequest({
    SpotFleetRequestId: config.spotFleetRequestId,
    TargetCapacity: desiredCapacity
  }).promise();
 
  return {
    statusCode: 200,
    body: JSON.stringify({
      queueDepth,
      desiredCapacity,
      timestamp: new Date().toISOString()
    })
  };
}

Spot Fleet configuration:

resource "aws_spot_fleet_request" "batch_processing" {
  iam_fleet_role      = aws_iam_role.spot_fleet.arn
  target_capacity     = 0  # Controlled by Lambda
  allocation_strategy = "capacityOptimized"
  fleet_type          = "maintain"
 
  # Terminate instances with no tasks
  terminate_instances_with_expiration = true
 
  launch_specification {
    ami                    = data.aws_ami.ecs_optimized.id
    instance_type          = "c5.2xlarge"
    spot_price             = "0.20"  # Max price (current spot ~$0.10)
    vpc_security_group_ids = [aws_security_group.batch.id]
    subnet_id              = var.private_subnet_ids[0]
 
    iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
 
    user_data = base64encode(<<-EOF
      #!/bin/bash
      # Start batch processor
      docker run -d \
        -e QUEUE_URL=${aws_sqs_queue.jobs.url} \
        -e AWS_REGION=${var.aws_region} \
        batch-processor:latest
    EOF
    )
  }
 
  # Add multiple instance types for availability
  launch_specification {
    ami                    = data.aws_ami.ecs_optimized.id
    instance_type          = "c5a.2xlarge"
    spot_price             = "0.20"
    vpc_security_group_ids = [aws_security_group.batch.id]
    subnet_id              = var.private_subnet_ids[0]
    iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
    user_data = base64encode(<<-EOF
      #!/bin/bash
      docker run -d \
        -e QUEUE_URL=${aws_sqs_queue.jobs.url} \
        -e AWS_REGION=${var.aws_region} \
        batch-processor:latest
    EOF
    )
  }
 
  launch_specification {
    ami                    = data.aws_ami.ecs_optimized.id
    instance_type          = "c5n.2xlarge"
    spot_price             = "0.20"
    vpc_security_group_ids = [aws_security_group.batch.id]
    subnet_id              = var.private_subnet_ids[0]
    iam_instance_profile_arn = aws_iam_instance_profile.batch.arn
    user_data = base64encode(<<-EOF
      #!/bin/bash
      docker run -d \
        -e QUEUE_URL=${aws_sqs_queue.jobs.url} \
        -e AWS_REGION=${var.aws_region} \
        batch-processor:latest
    EOF
    )
  }
}
 
# Lambda to scale Spot Fleet based on queue depth
resource "aws_lambda_function" "spot_fleet_scaler" {
  filename      = "spot_fleet_scaler.zip"
  function_name = "spot-fleet-autoscaler"
  role          = aws_iam_role.lambda.arn
  handler       = "index.handler"
  runtime       = "nodejs18.x"
  timeout       = 60
 
  environment {
    variables = {
      QUEUE_URL      = aws_sqs_queue.jobs.url
      SPOT_FLEET_ID  = aws_spot_fleet_request.batch_processing.id
    }
  }
}
 
# Trigger Lambda every minute
resource "aws_cloudwatch_event_rule" "spot_fleet_scaling" {
  name                = "spot-fleet-scaling"
  schedule_expression = "rate(1 minute)"
}
 
resource "aws_cloudwatch_event_target" "spot_fleet_scaling" {
  rule      = aws_cloudwatch_event_rule.spot_fleet_scaling.name
  target_id = "lambda"
  arn       = aws_lambda_function.spot_fleet_scaler.arn
}

Monitoring and Alerting

CloudWatch dashboard for spot interruptions:

resource "aws_cloudwatch_dashboard" "spot_monitoring" {
  dashboard_name = "spot-instance-monitoring"
 
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/EC2", "StatusCheckFailed", { stat = "Sum" }],
            [".", "SpotInstanceInterruptions", { stat = "Sum" }]
          ]
          period = 300
          stat   = "Sum"
          region = var.aws_region
          title  = "Spot Instance Interruptions"
        }
      },
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/ECS", "CPUUtilization", { stat = "Average" }],
            [".", "MemoryUtilization", { stat = "Average" }]
          ]
          period = 300
          stat   = "Average"
          region = var.aws_region
          title  = "ECS Cluster Resource Utilization"
        }
      },
      {
        type = "log"
        properties = {
          query = <<-EOQ
            SOURCE '/aws/ecs/cluster/production'
            | filter @message like /spot.*interrupt/
            | stats count() by bin(5m)
          EOQ
          region = var.aws_region
          title  = "Spot Interruption Log Events"
        }
      }
    ]
  })
}
 
# Alert on high interruption rate
resource "aws_cloudwatch_metric_alarm" "high_spot_interruption_rate" {
  alarm_name          = "high-spot-interruption-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "SpotInstanceInterruptions"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "Alert when spot interruptions exceed 5 in 5 minutes"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}
 
# Alert on insufficient capacity
resource "aws_cloudwatch_metric_alarm" "spot_capacity_shortage" {
  alarm_name          = "spot-capacity-shortage"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 3
  metric_name         = "TargetCapacity"
  namespace           = "AWS/EC2Spot"
  period              = 300
  statistic           = "Average"
  threshold           = 0.7  # Alert if less than 70% of target capacity fulfilled
  alarm_description   = "Alert when spot fleet cannot fulfill target capacity"
  alarm_actions       = [aws_sns_topic.alerts.arn]
 
  dimensions = {
    FleetRequestId = aws_spot_fleet_request.batch_processing.id
  }
}

Decision Framework

Use this checklist to decide if Spot is right for your workload:

✅ Use Spot If:

  • Workload is stateless or has checkpoint/resume capability
  • Multiple instances/replicas provide redundancy
  • Graceful shutdown can complete in under 2 minutes
  • Application handles instance loss without user impact
  • Monthly compute spend exceeds $5K
  • Engineering team can invest 40+ hours in implementation
  • You have monitoring and alerting infrastructure

❌ Don't Use Spot If:

  • Workload is stateful (databases, cache, queues) without failover
  • Single instance (no redundancy)
  • Critical SLA requirements (<500ms P99 latency)
  • Team lacks capacity for implementation and maintenance
  • Compute spend is under $2K/month (ROI too low)
  • Graceful shutdown impossible (e.g., long-lived WebSockets)

⚠️ Requires Extra Diligence:

  • GPU instances (higher interruption rates, 5-10%)
  • Kubernetes StatefulSets (requires pod disruption budgets)
  • ML training (must implement checkpointing)
  • Windows instances (slower boot times)

Real-World Case Study: API Migration to Spot

Company: B2B SaaS platform, $2M ARR Workload: RESTful API (Node.js), 40 ECS tasks, 10 m5.large instances Traffic: 50M requests/month, variable (2x peak during business hours)

Before (100% On-Demand)

10 × m5.large × $0.096/hour × 730 hours = $700.80/month
Annual: $8,410

After (70% Spot, 30% On-Demand)

On-Demand base: 3 × m5.large × $0.096/hour × 730 hours = $210.24/month
Spot capacity: 7 × m5.large × $0.0288/hour × 730 hours = $147.17/month

Total: $357.41/month
Annual: $4,289
Savings: $4,121/year (49%)

Implementation Details

Week 1: Architecture Review

  • Confirmed API was stateless
  • Verified connection draining configuration
  • Updated health check to support graceful shutdown

Week 2: Spot Implementation

  • Deployed spot interruption handler
  • Created mixed ASG (3 on-demand, 7 spot capacity)
  • Configured capacity-optimized allocation

Week 3: Testing

  • Simulated spot interruptions in staging
  • Verified task rescheduling
  • Measured P99 latency impact (no regression)

Week 4: Production Rollout

  • Gradual rollout: 25% → 50% → 70% spot
  • Monitored interruption rates: 2-3% hourly
  • No customer-reported issues

Results

Cost savings:

  • $4,121/year (49% reduction)
  • Engineering time: 60 hours

Operational impact:

  • 15-20 spot interruptions per month
  • Average recovery time: 45 seconds (automatic)
  • Zero customer-facing incidents
  • P99 latency unchanged: 180ms

Lessons learned:

  1. Diversify instance types: Using m5, m5a, m5n reduced interruption rate
  2. Capacity-optimized allocation works: Much better than lowest-price
  3. Start conservative: 30% spot, then increase confidence
  4. Monitor actively: Set up dashboards before migration

Conclusion: Spot for Production is Ready (If You Are)

In 2025, Spot Instances are production-viable for many workloads—but not all.

The key question isn't "Can I use Spot?" but "Should I use Spot?"

Use Spot if:

  • Your workload is stateless or fault-tolerant
  • You have redundancy built in
  • You can implement graceful shutdown
  • Your compute spend justifies the engineering effort

Skip Spot if:

  • Your workload is stateful without failover
  • You have tight SLA requirements
  • Your team lacks bandwidth for implementation
  • Your compute spend is too low (under $5K/month)

The middle path: Start with non-critical workloads (dev/staging, batch jobs) to build confidence, then migrate production workloads incrementally.

Spot Instances aren't magic—they're a trade-off. You're exchanging engineering time and operational complexity for significant cost savings. If you're spending over $10K/month on compute, that trade-off is usually worth it.

Action Items

  1. Audit your current compute spend by instance type and workload
  2. Identify stateless workloads suitable for spot (batch jobs, APIs, workers)
  3. Calculate potential savings using spot pricing history
  4. Implement graceful shutdown handlers for your applications
  5. Start with dev/staging environments to validate the approach
  6. Deploy spot for one production workload and monitor for 30 days
  7. Expand gradually as confidence builds

If you need help designing a spot migration strategy for your infrastructure, schedule a consultation. We'll analyze your workloads, calculate ROI, and provide a phased implementation plan that minimizes risk while maximizing savings.

Need Help with Your Cloud Infrastructure?

Our experts are here to guide you through your cloud journey

Schedule a Free Consultation