Architecture

Disaster Recovery Tiers: RPO/RTO Strategies for the Cloud

Updated By Zak Kann
AWSDisaster RecoveryHigh AvailabilityRTORPOMulti-RegionBusiness Continuity

Key takeaways

  • RPO (Recovery Point Objective) defines acceptable data loss; RTO (Recovery Time Objective) defines acceptable downtime
  • Four DR tiers range from backup/restore (RPO: hours, RTO: 24+ hours, cost: lowest) to multi-site active-active (RPO: seconds, RTO: <1 minute, cost: 2-3x production)
  • Most applications don't need active-active; pilot light or warm standby provides 99.9% availability at 30-40% of active-active cost
  • DR testing must be automated and run quarterly to ensure reliability when actually needed
  • Cloud-native services (RDS cross-region replication, DynamoDB global tables, S3 cross-region replication) simplify DR implementation significantly

Your CEO asks a simple question: "If our entire AWS region goes down, how long until we're back online?" You know the answer should be measured in minutes, but you haven't tested failover in 6 months. Your DR "plan" is a runbook document with 47 manual steps. The backup restoration process took 8 hours last time you tried it in staging.

Disaster recovery is the insurance policy no one wants to pay for until they need it. But unlike traditional insurance, cloud DR can be cost-effective when designed appropriately. The key is matching your DR tier to actual business requirements rather than implementing active-active everywhere "just in case."

This guide provides a comprehensive framework for designing, implementing, and testing disaster recovery strategies in AWS, with specific focus on the RPO/RTO tradeoffs and cost models for each tier.

Understanding RPO and RTO

Definitions

Recovery Point Objective (RPO): Maximum acceptable data loss, measured in time.

  • RPO = 1 hour: Can tolerate losing up to 1 hour of data
  • RPO = 0: Zero data loss acceptable (requires synchronous replication)

Recovery Time Objective (RTO): Maximum acceptable downtime, measured in time.

  • RTO = 4 hours: Must restore service within 4 hours
  • RTO = 0: Zero downtime acceptable (requires active-active architecture)

Business Impact Example

E-commerce Platform:

  • Average transaction value: $100
  • Transactions per hour: 1,000
  • Revenue per hour: $100,000

Cost of downtime:

  • RTO = 4 hours: $400,000 lost revenue
  • RTO = 1 hour: $100,000 lost revenue
  • RTO = 5 minutes: $8,333 lost revenue

Cost of data loss:

  • RPO = 1 hour: 1,000 transactions to recreate/recover
  • RPO = 5 minutes: 83 transactions to recreate/recover
  • RPO = 0: No transaction loss

The Four Disaster Recovery Tiers

Tier 1: Backup and Restore

Characteristics:

  • RPO: Hours to 24 hours
  • RTO: 10+ hours
  • Cost: ~10-15% of production infrastructure
  • Availability: 95-99%

When to use:

  • Development/staging environments
  • Internal tools with low business impact
  • Batch processing systems
  • Data warehouses (acceptable to restore from previous night's backup)

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Production│──────► Automated Backups
β”‚  Region  β”‚        (S3, RDS Snapshots,
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        EBS Snapshots)
                           β”‚
                           β”‚ Copy to DR Region
                           β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ DR Regionβ”‚
                    β”‚ (No Infra)β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation:

# RDS automated backups with cross-region copy
resource "aws_db_instance" "production" {
  identifier              = "production-db"
  engine                  = "postgres"
  instance_class          = "db.r6g.xlarge"
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable automated backups
  skip_final_snapshot = false
  final_snapshot_identifier = "production-db-final-snapshot"
}
 
# Copy snapshots to DR region
resource "aws_db_snapshot_copy" "dr_backup" {
  provider                   = aws.dr_region
  source_db_snapshot_identifier = aws_db_instance.production.latest_restorable_time
  target_db_snapshot_identifier = "dr-backup-${formatdate("YYYY-MM-DD", timestamp())}"
  copy_tags                  = true
  kms_key_id                 = aws_kms_key.dr_region.arn
}
 
# S3 bucket replication to DR region
resource "aws_s3_bucket" "production_data" {
  bucket = "production-data"
}
 
resource "aws_s3_bucket_replication_configuration" "dr_replication" {
  bucket = aws_s3_bucket.production_data.id
  role   = aws_iam_role.replication.arn
 
  rule {
    id     = "replicate-to-dr"
    status = "Enabled"
 
    destination {
      bucket        = aws_s3_bucket.dr_data.arn
      storage_class = "STANDARD_IA"  # Cost savings
 
      replication_time {
        status = "Enabled"
        time {
          minutes = 15
        }
      }
 
      metrics {
        status = "Enabled"
      }
    }
  }
}
 
# Lifecycle policy for DR backups
resource "aws_backup_plan" "disaster_recovery" {
  name = "disaster-recovery-plan"
 
  rule {
    rule_name         = "daily_backup"
    target_vault_name = aws_backup_vault.dr_vault.name
    schedule          = "cron(0 5 * * ? *)"  # 5 AM daily
 
    lifecycle {
      delete_after = 30  # Retain for 30 days
    }
 
    copy_action {
      destination_vault_arn = aws_backup_vault.dr_region_vault.arn
 
      lifecycle {
        delete_after = 30
      }
    }
  }
}

Recovery Process:

  1. Restore RDS snapshot in DR region (1-2 hours)
  2. Restore application state from S3 (30 minutes)
  3. Launch EC2/ECS from AMIs/container images (30 minutes)
  4. Update DNS to point to DR region (5 minutes propagation)
  5. Total: ~3-4 hours

Monthly Cost (for production spending $10K/month):

  • Cross-region snapshot storage: $300
  • S3 replication: $200
  • Data transfer for replication: $500
  • Total: $1,000/month (10% of production)

Tier 2: Pilot Light

Characteristics:

  • RPO: Minutes to 1 hour
  • RTO: 1-4 hours
  • Cost: ~30-40% of production infrastructure
  • Availability: 99.5-99.9%

When to use:

  • Business-critical applications
  • Customer-facing services with moderate SLAs
  • Financial services platforms
  • B2B SaaS applications

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Production  β”‚
β”‚              β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚    RDS   ││────────►│  DR Region   β”‚
β”‚ β”‚(Primary) β”‚β”‚Async    β”‚              β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚Replica  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚              β”‚         β”‚ β”‚    RDS   β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚         β”‚ β”‚(Replica) β”‚ β”‚
β”‚ β”‚   ECS    β”‚β”‚         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚(Running) β”‚β”‚         β”‚              β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚ β”‚   ECS    β”‚ β”‚
                         β”‚ β”‚ (Minimal)β”‚ β”‚
                         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core services running in DR region at minimal capacity; scaled up during disaster.

Implementation:

# RDS with cross-region read replica
resource "aws_db_instance" "production" {
  identifier     = "production-db"
  engine         = "postgres"
  instance_class = "db.r6g.2xlarge"
}
 
resource "aws_db_instance" "dr_replica" {
  provider              = aws.dr_region
  identifier            = "dr-replica-db"
  replicate_source_db   = aws_db_instance.production.arn
  instance_class        = "db.r6g.2xlarge"  # Same size for quick promotion
  skip_final_snapshot   = true
  auto_minor_version_upgrade = false
}
 
# ECS service at minimal capacity in DR region
resource "aws_ecs_service" "dr_app" {
  provider        = aws.dr_region
  name            = "app-service-dr"
  cluster         = aws_ecs_cluster.dr.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 1  # Minimal capacity (production might be 10)
 
  deployment_configuration {
    minimum_healthy_percent = 0
    maximum_percent         = 200
  }
}
 
# Auto Scaling for failover
resource "aws_appautoscaling_target" "dr_app" {
  provider           = aws.dr_region
  max_capacity       = 20  # Scale to production capacity on demand
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.dr.name}/${aws_ecs_service.dr_app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}
 
# Route 53 health checks and failover
resource "aws_route53_health_check" "production" {
  fqdn              = "api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}
 
resource "aws_route53_record" "api_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
 
  set_identifier = "primary"
  health_check_id = aws_route53_health_check.production.id
 
  failover_routing_policy {
    type = "PRIMARY"
  }
 
  alias {
    name                   = aws_lb.production.dns_name
    zone_id                = aws_lb.production.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "api_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
 
  set_identifier = "secondary"
 
  failover_routing_policy {
    type = "SECONDARY"
  }
 
  alias {
    name                   = aws_lb.dr.dns_name
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = false
  }
}

Recovery Process:

  1. Promote RDS read replica to primary (5-10 minutes)
  2. Scale ECS service from 1 to 10 tasks (2-5 minutes)
  3. Route 53 automatic failover (1 minute)
  4. Total: 10-15 minutes

Monthly Cost (for production spending $10K/month):

  • RDS read replica: $2,000
  • ECS minimal capacity: $150
  • Load balancer: $25
  • Data replication: $500
  • Total: $2,675/month (27% of production)

Tier 3: Warm Standby

Characteristics:

  • RPO: Seconds to minutes
  • RTO: Minutes to 1 hour
  • Cost: ~50-60% of production infrastructure
  • Availability: 99.9-99.95%

When to use:

  • High-value customer-facing applications
  • Trading platforms
  • Payment processing
  • Real-time services with SLAs

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Production  β”‚         β”‚  DR Region   β”‚
β”‚              β”‚         β”‚              β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚    RDS   ││────────►│ β”‚    RDS   β”‚ β”‚
β”‚ β”‚(Primary) β”‚β”‚ Sync/   β”‚ β”‚(Replica) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚ Async   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚              β”‚         β”‚              β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚   ECS    β”‚β”‚         β”‚ β”‚   ECS    β”‚ β”‚
β”‚ β”‚(Full Cap)β”‚β”‚         β”‚ β”‚(50% Cap) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚              β”‚         β”‚              β”‚
β”‚ Traffic: 100%β”‚         β”‚ Traffic: 0%  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Full infrastructure running at reduced capacity; can handle traffic immediately.

Implementation:

# DynamoDB Global Tables (automatic multi-region replication)
resource "aws_dynamodb_table" "users" {
  name             = "users"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "userId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
 
  attribute {
    name = "userId"
    type = "S"
  }
 
  replica {
    region_name = "us-west-2"  # DR region
  }
 
  replica {
    region_name = "eu-west-1"  # Additional DR region
  }
}
 
# Aurora Global Database (1-second replication lag)
resource "aws_rds_global_cluster" "main" {
  global_cluster_identifier = "global-db"
  engine                    = "aurora-postgresql"
  engine_version            = "14.6"
  database_name             = "production"
}
 
resource "aws_rds_cluster" "primary" {
  cluster_identifier        = "primary-cluster"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = aws_rds_global_cluster.main.engine
  engine_version            = aws_rds_global_cluster.main.engine_version
  database_name             = "production"
}
 
resource "aws_rds_cluster" "secondary" {
  provider                  = aws.dr_region
  cluster_identifier        = "secondary-cluster"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = aws_rds_global_cluster.main.engine
  engine_version            = aws_rds_global_cluster.main.engine_version
 
  # Depends on primary being created first
  depends_on = [aws_rds_cluster.primary]
}
 
# ECS at 50% capacity in DR region
resource "aws_ecs_service" "dr_app" {
  provider        = aws.dr_region
  name            = "app-service-dr"
  cluster         = aws_ecs_cluster.dr.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 5  # 50% of production (10 tasks)
 
  # Can handle traffic immediately
  load_balancer {
    target_group_arn = aws_lb_target_group.dr_app.arn
    container_name   = "app"
    container_port   = 8080
  }
}
 
# Route 53 weighted routing (send small percentage to DR for testing)
resource "aws_route53_record" "api_primary_weighted" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
 
  set_identifier = "primary"
  weight         = 95  # 95% of traffic
 
  alias {
    name                   = aws_lb.production.dns_name
    zone_id                = aws_lb.production.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "api_secondary_weighted" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
 
  set_identifier = "secondary"
  weight         = 5  # 5% of traffic (continuous testing)
 
  alias {
    name                   = aws_lb.dr.dns_name
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = true
  }
}

Recovery Process:

  1. Promote Aurora secondary to primary (1 minute)
  2. Update Route 53 weights (100% to DR) (1 minute propagation)
  3. Scale ECS from 50% to 100% capacity (2-3 minutes)
  4. Total: 5 minutes

Monthly Cost (for production spending $10K/month):

  • Aurora global database: $3,000
  • DynamoDB global tables: $1,500
  • ECS at 50% capacity: $2,500
  • Load balancer: $25
  • Total: $7,025/month (70% of production)

Tier 4: Multi-Site Active-Active

Characteristics:

  • RPO: 0 (no data loss)
  • RTO: 0 (no downtime)
  • Cost: 200-300% of single-region production
  • Availability: 99.99-99.999%

When to use:

  • Mission-critical financial systems
  • Trading platforms
  • Healthcare systems
  • Services with contractual 99.99%+ SLAs

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Region 1   │◄───────►│   Region 2   β”‚
β”‚              β”‚ Bi-dir  β”‚              β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ Sync    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ DynamoDB ││◄───────►│ β”‚ DynamoDB β”‚ β”‚
β”‚ β”‚  Global  β”‚β”‚         β”‚ β”‚  Global  β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚              β”‚         β”‚              β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚   ECS    β”‚β”‚         β”‚ β”‚   ECS    β”‚ β”‚
β”‚ β”‚(Full Cap)β”‚β”‚         β”‚ β”‚(Full Cap)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚              β”‚         β”‚              β”‚
β”‚ Traffic: 50% β”‚         β”‚ Traffic: 50% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Both regions handle production traffic; failure of one region is transparent.

Implementation:

# DynamoDB Global Tables (multi-region active-active)
resource "aws_dynamodb_table" "transactions" {
  name             = "transactions"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "transactionId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
 
  attribute {
    name = "transactionId"
    type = "S"
  }
 
  # Multi-region replication
  replica {
    region_name = "us-east-1"
  }
 
  replica {
    region_name = "us-west-2"
  }
 
  replica {
    region_name = "eu-west-1"
  }
 
  point_in_time_recovery {
    enabled = true
  }
}
 
# Aurora Global Database with write forwarding
resource "aws_rds_cluster" "region1" {
  cluster_identifier              = "region1-cluster"
  global_cluster_identifier       = aws_rds_global_cluster.main.id
  engine                          = "aurora-postgresql"
  engine_mode                     = "provisioned"
  database_name                   = "production"
  enable_global_write_forwarding  = true  # Allow writes from any region
}
 
# Route 53 latency-based routing
resource "aws_route53_record" "api_region1" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
 
  set_identifier = "us-east-1"
 
  latency_routing_policy {
    region = "us-east-1"
  }
 
  alias {
    name                   = aws_lb.region1.dns_name
    zone_id                = aws_lb.region1.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "api_region2" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
 
  set_identifier = "us-west-2"
 
  latency_routing_policy {
    region = "us-west-2"
  }
 
  alias {
    name                   = aws_lb.region2.dns_name
    zone_id                = aws_lb.region2.zone_id
    evaluate_target_health = true
  }
}
 
# CloudFront for global distribution
resource "aws_cloudfront_distribution" "api" {
  enabled = true
 
  origin {
    domain_name = "api.example.com"  # Route 53 handles regional routing
    origin_id   = "api-multi-region"
 
    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }
 
  default_cache_behavior {
    allowed_methods        = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
    cached_methods         = ["GET", "HEAD"]
    target_origin_id       = "api-multi-region"
    viewer_protocol_policy = "redirect-to-https"
    compress               = true
 
    forwarded_values {
      query_string = true
      headers      = ["*"]
      cookies {
        forward = "all"
      }
    }
  }
 
  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }
 
  viewer_certificate {
    acm_certificate_arn      = aws_acm_certificate.cert.arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }
}

Recovery Process:

  • Automatic: Route 53 health checks detect failure and route traffic to healthy region
  • No manual intervention required
  • User impact: None (sub-second failover via DNS)

Monthly Cost (for production spending $10K/month):

  • Infrastructure in 2+ regions: $20,000
  • DynamoDB global tables: $3,000
  • Aurora global database: $5,000
  • Cross-region data transfer: $2,000
  • Total: $30,000/month (300% of single-region production)

Cost vs. Availability Matrix

DR TierRPORTOAvailabilityMonthly CostUse Case
Backup/Restore24h24h95-99%10-15%Dev/test, low-priority
Pilot Light1h1-4h99.5-99.9%30-40%Business-critical apps
Warm Standby1m5m-1h99.9-99.95%50-70%High-value services
Active-Active0099.99-99.999%200-300%Mission-critical only

DR Testing Strategy

Automated Testing Framework

# Lambda function for automated DR tests
resource "aws_lambda_function" "dr_test" {
  function_name = "dr-automated-test"
  runtime       = "python3.11"
  handler       = "index.handler"
  timeout       = 900  # 15 minutes
 
  environment {
    variables = {
      PRIMARY_REGION = "us-east-1"
      DR_REGION      = "us-west-2"
      RDS_IDENTIFIER = aws_db_instance.dr_replica.identifier
    }
  }
}
 
# EventBridge rule for quarterly DR tests
resource "aws_cloudwatch_event_rule" "dr_test_schedule" {
  name                = "quarterly-dr-test"
  description         = "Run DR test every quarter"
  schedule_expression = "cron(0 2 1 */3 ? *)"  # 2 AM on 1st day of every 3rd month
}
 
resource "aws_cloudwatch_event_target" "dr_test" {
  rule      = aws_cloudwatch_event_rule.dr_test_schedule.name
  target_id = "DRTestLambda"
  arn       = aws_lambda_function.dr_test.arn
}

DR Test Script:

import boto3
import time
from datetime import datetime
 
def handler(event, context):
    """
    Automated DR test:
    1. Promote RDS read replica to standalone
    2. Launch ECS tasks in DR region
    3. Update Route 53 to point to DR
    4. Run smoke tests
    5. Rollback
    """
 
    results = {
        'test_time': datetime.utcnow().isoformat(),
        'steps': []
    }
 
    try:
        # Step 1: Promote RDS replica (simulation - don't actually promote)
        rds = boto3.client('rds', region_name='us-west-2')
        start_time = time.time()
 
        # Measure time to promote (dry run)
        results['steps'].append({
            'step': 'RDS Promotion (simulated)',
            'duration_seconds': time.time() - start_time,
            'status': 'success'
        })
 
        # Step 2: Scale ECS service
        ecs = boto3.client('ecs', region_name='us-west-2')
        start_time = time.time()
 
        ecs.update_service(
            cluster='dr-cluster',
            service='app-service-dr',
            desiredCount=10  # Scale to production capacity
        )
 
        # Wait for tasks to be running
        waiter = ecs.get_waiter('services_stable')
        waiter.wait(cluster='dr-cluster', services=['app-service-dr'])
 
        results['steps'].append({
            'step': 'ECS Scaling',
            'duration_seconds': time.time() - start_time,
            'status': 'success'
        })
 
        # Step 3: Run smoke tests against DR environment
        start_time = time.time()
        smoke_test_results = run_smoke_tests('dr-region-endpoint')
 
        results['steps'].append({
            'step': 'Smoke Tests',
            'duration_seconds': time.time() - start_time,
            'status': 'success' if smoke_test_results['passed'] else 'failed',
            'details': smoke_test_results
        })
 
        # Step 4: Rollback - scale ECS back down
        ecs.update_service(
            cluster='dr-cluster',
            service='app-service-dr',
            desiredCount=1  # Back to pilot light capacity
        )
 
        results['overall_status'] = 'success'
        results['total_rto_measured'] = sum(step['duration_seconds'] for step in results['steps'])
 
    except Exception as e:
        results['overall_status'] = 'failed'
        results['error'] = str(e)
 
    # Publish results to SNS for team notification
    sns = boto3.client('sns')
    sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789012:dr-test-results',
        Subject=f"DR Test Results: {results['overall_status']}",
        Message=json.dumps(results, indent=2)
    )
 
    return results
 
def run_smoke_tests(endpoint):
    """Run basic health checks and critical path tests"""
    import requests
 
    tests = {
        'health_check': f'https://{endpoint}/health',
        'auth_flow': f'https://{endpoint}/api/auth/token',
        'critical_api': f'https://{endpoint}/api/transactions'
    }
 
    results = {'passed': True, 'tests': []}
 
    for test_name, url in tests.items():
        try:
            response = requests.get(url, timeout=10)
            passed = response.status_code == 200
            results['tests'].append({
                'name': test_name,
                'status': 'passed' if passed else 'failed',
                'status_code': response.status_code
            })
            if not passed:
                results['passed'] = False
        except Exception as e:
            results['tests'].append({
                'name': test_name,
                'status': 'failed',
                'error': str(e)
            })
            results['passed'] = False
 
    return results

Chaos Engineering for DR

# AWS FIS (Fault Injection Simulator) experiment
resource "aws_fis_experiment_template" "region_failure" {
  description = "Simulate complete region failure"
  role_arn    = aws_iam_role.fis.arn
 
  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate.arn
  }
 
  action {
    name      = "terminate-az"
    action_id = "aws:ec2:terminate-instances"
 
    target {
      key   = "Instances"
      value = "production-instances"
    }
 
    parameter {
      key   = "percentage"
      value = "100"
    }
  }
 
  target {
    name           = "production-instances"
    resource_type  = "aws:ec2:instance"
    selection_mode = "ALL"
 
    resource_tag {
      key   = "Environment"
      value = "production"
    }
 
    resource_tag {
      key   = "Region"
      value = "us-east-1"
    }
  }
}

Decision Framework

Step 1: Calculate Business Impact

Downtime Cost = Revenue per Hour Γ— RTO Hours
Data Loss Cost = Transactions per Hour Γ— RPO Hours Γ— Average Transaction Value

Step 2: Determine Required Availability

Business ImpactTarget AvailabilityDR TierMonthly Downtime
<$10K/hour95%Backup/Restore36 hours
$10K-$100K/hour99.5%Pilot Light3.6 hours
$100K-$1M/hour99.9%Warm Standby43 minutes
>$1M/hour99.99%Active-Active4.3 minutes

Step 3: Compare DR Cost vs. Downtime Cost

Example: E-commerce platform

  • Current revenue: $100K/hour
  • Current DR: None (Tier 0)
  • Expected annual downtime: 10 hours
  • Annual downtime cost: $1M

DR Investment Options:

TierMonthly CostAnnual CostExpected DowntimeDowntime CostNet Benefit
Pilot Light$3K$36K2 hours$200K+$764K
Warm Standby$7K$84K30 minutes$50K+$866K
Active-Active$30K$360K0$0+$640K

Winner: Warm Standby (best ROI)

Monitoring DR Readiness

# CloudWatch dashboard for DR metrics
resource "aws_cloudwatch_dashboard" "dr_readiness" {
  dashboard_name = "disaster-recovery-readiness"
 
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/RDS", "ReplicaLag", { stat = "Maximum", label = "RDS Replica Lag" }],
            ["AWS/DynamoDB", "ReplicationLatency", { stat = "Average", label = "DynamoDB Replication" }],
          ]
          period = 300
          stat   = "Average"
          region = "us-east-1"
          title  = "Data Replication Lag"
          yAxis = {
            left = {
              label = "Seconds"
              min   = 0
            }
          }
        }
      },
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/Route53", "HealthCheckStatus", { stat = "Minimum" }],
          ]
          period = 60
          stat   = "Minimum"
          region = "us-east-1"
          title  = "Route 53 Health Checks"
        }
      }
    ]
  })
}
 
# Alert on high replication lag
resource "aws_cloudwatch_metric_alarm" "replication_lag" {
  alarm_name          = "rds-replication-lag-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ReplicaLag"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Average"
  threshold           = 30  # 30 seconds
  alarm_description   = "RDS replication lag exceeds 30 seconds"
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
 
  dimensions = {
    DBInstanceIdentifier = aws_db_instance.dr_replica.id
  }
}

Conclusion: Right-Size Your DR Investment

Most organizations over-invest in active-active architecture for workloads that don't require it, or under-invest with no DR for business-critical systems. The key is matching DR tier to actual business impact:

  1. Calculate downtime cost: Revenue/hour Γ— expected annual downtime
  2. Evaluate DR tiers: Compare annual DR cost vs. downtime cost reduction
  3. Choose tier with best ROI: Usually pilot light or warm standby for most applications
  4. Test quarterly: Automated testing ensures DR works when needed
  5. Monitor continuously: Replication lag and health checks

For a $10M/year revenue business:

  • Downtime cost: ~$5K/hour
  • Expected downtime without DR: 20 hours/year = $100K
  • Pilot light DR cost: $36K/year
  • Net benefit: $64K/year

Don't pay for active-active when warm standby delivers the same business outcome at 1/4 the cost.


Action Items:

  1. Calculate your actual downtime cost (revenue per hour)
  2. Audit current DR posture (what tier are you at today?)
  3. Identify RPO/RTO requirements for each application
  4. Map applications to appropriate DR tiers
  5. Implement automated DR testing (start with pilot light for critical apps)
  6. Run quarterly DR tests and measure actual RTO
  7. Monitor replication lag continuously
  8. Document and practice runbooks for failover procedures

Need Help with Your Cloud Infrastructure?

Our experts are here to guide you through your cloud journey

Schedule a Free Consultation