Disaster Recovery Tiers: RPO/RTO Strategies for the Cloud
Key takeaways
- RPO (Recovery Point Objective) defines acceptable data loss; RTO (Recovery Time Objective) defines acceptable downtime
- Four DR tiers range from backup/restore (RPO: hours, RTO: 24+ hours, cost: lowest) to multi-site active-active (RPO: seconds, RTO: <1 minute, cost: 2-3x production)
- Most applications don't need active-active; pilot light or warm standby provides 99.9% availability at 30-40% of active-active cost
- DR testing must be automated and run quarterly to ensure reliability when actually needed
- Cloud-native services (RDS cross-region replication, DynamoDB global tables, S3 cross-region replication) simplify DR implementation significantly
Your CEO asks a simple question: "If our entire AWS region goes down, how long until we're back online?" You know the answer should be measured in minutes, but you haven't tested failover in 6 months. Your DR "plan" is a runbook document with 47 manual steps. The backup restoration process took 8 hours last time you tried it in staging.
Disaster recovery is the insurance policy no one wants to pay for until they need it. But unlike traditional insurance, cloud DR can be cost-effective when designed appropriately. The key is matching your DR tier to actual business requirements rather than implementing active-active everywhere "just in case."
This guide provides a comprehensive framework for designing, implementing, and testing disaster recovery strategies in AWS, with specific focus on the RPO/RTO tradeoffs and cost models for each tier.
Understanding RPO and RTO
Definitions
Recovery Point Objective (RPO): Maximum acceptable data loss, measured in time.
- RPO = 1 hour: Can tolerate losing up to 1 hour of data
- RPO = 0: Zero data loss acceptable (requires synchronous replication)
Recovery Time Objective (RTO): Maximum acceptable downtime, measured in time.
- RTO = 4 hours: Must restore service within 4 hours
- RTO = 0: Zero downtime acceptable (requires active-active architecture)
Business Impact Example
E-commerce Platform:
- Average transaction value: $100
- Transactions per hour: 1,000
- Revenue per hour: $100,000
Cost of downtime:
- RTO = 4 hours: $400,000 lost revenue
- RTO = 1 hour: $100,000 lost revenue
- RTO = 5 minutes: $8,333 lost revenue
Cost of data loss:
- RPO = 1 hour: 1,000 transactions to recreate/recover
- RPO = 5 minutes: 83 transactions to recreate/recover
- RPO = 0: No transaction loss
The Four Disaster Recovery Tiers
Tier 1: Backup and Restore
Characteristics:
- RPO: Hours to 24 hours
- RTO: 10+ hours
- Cost: ~10-15% of production infrastructure
- Availability: 95-99%
When to use:
- Development/staging environments
- Internal tools with low business impact
- Batch processing systems
- Data warehouses (acceptable to restore from previous night's backup)
Architecture:
ββββββββββββ
βProductionββββββββΊ Automated Backups
β Region β (S3, RDS Snapshots,
ββββββββββββ EBS Snapshots)
β
β Copy to DR Region
βΌ
ββββββββββββ
β DR Regionβ
β (No Infra)β
ββββββββββββ
Implementation:
# RDS automated backups with cross-region copy
resource "aws_db_instance" "production" {
identifier = "production-db"
engine = "postgres"
instance_class = "db.r6g.xlarge"
backup_retention_period = 7
backup_window = "03:00-04:00"
# Enable automated backups
skip_final_snapshot = false
final_snapshot_identifier = "production-db-final-snapshot"
}
# Copy snapshots to DR region
resource "aws_db_snapshot_copy" "dr_backup" {
provider = aws.dr_region
source_db_snapshot_identifier = aws_db_instance.production.latest_restorable_time
target_db_snapshot_identifier = "dr-backup-${formatdate("YYYY-MM-DD", timestamp())}"
copy_tags = true
kms_key_id = aws_kms_key.dr_region.arn
}
# S3 bucket replication to DR region
resource "aws_s3_bucket" "production_data" {
bucket = "production-data"
}
resource "aws_s3_bucket_replication_configuration" "dr_replication" {
bucket = aws_s3_bucket.production_data.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-to-dr"
status = "Enabled"
destination {
bucket = aws_s3_bucket.dr_data.arn
storage_class = "STANDARD_IA" # Cost savings
replication_time {
status = "Enabled"
time {
minutes = 15
}
}
metrics {
status = "Enabled"
}
}
}
}
# Lifecycle policy for DR backups
resource "aws_backup_plan" "disaster_recovery" {
name = "disaster-recovery-plan"
rule {
rule_name = "daily_backup"
target_vault_name = aws_backup_vault.dr_vault.name
schedule = "cron(0 5 * * ? *)" # 5 AM daily
lifecycle {
delete_after = 30 # Retain for 30 days
}
copy_action {
destination_vault_arn = aws_backup_vault.dr_region_vault.arn
lifecycle {
delete_after = 30
}
}
}
}Recovery Process:
- Restore RDS snapshot in DR region (1-2 hours)
- Restore application state from S3 (30 minutes)
- Launch EC2/ECS from AMIs/container images (30 minutes)
- Update DNS to point to DR region (5 minutes propagation)
- Total: ~3-4 hours
Monthly Cost (for production spending $10K/month):
- Cross-region snapshot storage: $300
- S3 replication: $200
- Data transfer for replication: $500
- Total: $1,000/month (10% of production)
Tier 2: Pilot Light
Characteristics:
- RPO: Minutes to 1 hour
- RTO: 1-4 hours
- Cost: ~30-40% of production infrastructure
- Availability: 99.5-99.9%
When to use:
- Business-critical applications
- Customer-facing services with moderate SLAs
- Financial services platforms
- B2B SaaS applications
Architecture:
ββββββββββββββββ
β Production β
β β
β βββββββββββββ ββββββββββββββββ
β β RDS βββββββββββΊβ DR Region β
β β(Primary) ββAsync β β
β βββββββββββββReplica β ββββββββββββ β
β β β β RDS β β
β βββββββββββββ β β(Replica) β β
β β ECS ββ β ββββββββββββ β
β β(Running) ββ β β
β βββββββββββββ β ββββββββββββ β
ββββββββββββββββ β β ECS β β
β β (Minimal)β β
β ββββββββββββ β
ββββββββββββββββ
Core services running in DR region at minimal capacity; scaled up during disaster.
Implementation:
# RDS with cross-region read replica
resource "aws_db_instance" "production" {
identifier = "production-db"
engine = "postgres"
instance_class = "db.r6g.2xlarge"
}
resource "aws_db_instance" "dr_replica" {
provider = aws.dr_region
identifier = "dr-replica-db"
replicate_source_db = aws_db_instance.production.arn
instance_class = "db.r6g.2xlarge" # Same size for quick promotion
skip_final_snapshot = true
auto_minor_version_upgrade = false
}
# ECS service at minimal capacity in DR region
resource "aws_ecs_service" "dr_app" {
provider = aws.dr_region
name = "app-service-dr"
cluster = aws_ecs_cluster.dr.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 1 # Minimal capacity (production might be 10)
deployment_configuration {
minimum_healthy_percent = 0
maximum_percent = 200
}
}
# Auto Scaling for failover
resource "aws_appautoscaling_target" "dr_app" {
provider = aws.dr_region
max_capacity = 20 # Scale to production capacity on demand
min_capacity = 1
resource_id = "service/${aws_ecs_cluster.dr.name}/${aws_ecs_service.dr_app.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
# Route 53 health checks and failover
resource "aws_route53_health_check" "production" {
fqdn = "api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "api_primary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary"
health_check_id = aws_route53_health_check.production.id
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = aws_lb.production.dns_name
zone_id = aws_lb.production.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "api_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.dr.dns_name
zone_id = aws_lb.dr.zone_id
evaluate_target_health = false
}
}Recovery Process:
- Promote RDS read replica to primary (5-10 minutes)
- Scale ECS service from 1 to 10 tasks (2-5 minutes)
- Route 53 automatic failover (1 minute)
- Total: 10-15 minutes
Monthly Cost (for production spending $10K/month):
- RDS read replica: $2,000
- ECS minimal capacity: $150
- Load balancer: $25
- Data replication: $500
- Total: $2,675/month (27% of production)
Tier 3: Warm Standby
Characteristics:
- RPO: Seconds to minutes
- RTO: Minutes to 1 hour
- Cost: ~50-60% of production infrastructure
- Availability: 99.9-99.95%
When to use:
- High-value customer-facing applications
- Trading platforms
- Payment processing
- Real-time services with SLAs
Architecture:
ββββββββββββββββ ββββββββββββββββ
β Production β β DR Region β
β β β β
β βββββββββββββ β ββββββββββββ β
β β RDS βββββββββββΊβ β RDS β β
β β(Primary) ββ Sync/ β β(Replica) β β
β βββββββββββββ Async β ββββββββββββ β
β β β β
β βββββββββββββ β ββββββββββββ β
β β ECS ββ β β ECS β β
β β(Full Cap)ββ β β(50% Cap) β β
β βββββββββββββ β ββββββββββββ β
β β β β
β Traffic: 100%β β Traffic: 0% β
ββββββββββββββββ ββββββββββββββββ
Full infrastructure running at reduced capacity; can handle traffic immediately.
Implementation:
# DynamoDB Global Tables (automatic multi-region replication)
resource "aws_dynamodb_table" "users" {
name = "users"
billing_mode = "PAY_PER_REQUEST"
hash_key = "userId"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "userId"
type = "S"
}
replica {
region_name = "us-west-2" # DR region
}
replica {
region_name = "eu-west-1" # Additional DR region
}
}
# Aurora Global Database (1-second replication lag)
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "global-db"
engine = "aurora-postgresql"
engine_version = "14.6"
database_name = "production"
}
resource "aws_rds_cluster" "primary" {
cluster_identifier = "primary-cluster"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = aws_rds_global_cluster.main.engine
engine_version = aws_rds_global_cluster.main.engine_version
database_name = "production"
}
resource "aws_rds_cluster" "secondary" {
provider = aws.dr_region
cluster_identifier = "secondary-cluster"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = aws_rds_global_cluster.main.engine
engine_version = aws_rds_global_cluster.main.engine_version
# Depends on primary being created first
depends_on = [aws_rds_cluster.primary]
}
# ECS at 50% capacity in DR region
resource "aws_ecs_service" "dr_app" {
provider = aws.dr_region
name = "app-service-dr"
cluster = aws_ecs_cluster.dr.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 5 # 50% of production (10 tasks)
# Can handle traffic immediately
load_balancer {
target_group_arn = aws_lb_target_group.dr_app.arn
container_name = "app"
container_port = 8080
}
}
# Route 53 weighted routing (send small percentage to DR for testing)
resource "aws_route53_record" "api_primary_weighted" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary"
weight = 95 # 95% of traffic
alias {
name = aws_lb.production.dns_name
zone_id = aws_lb.production.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "api_secondary_weighted" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "secondary"
weight = 5 # 5% of traffic (continuous testing)
alias {
name = aws_lb.dr.dns_name
zone_id = aws_lb.dr.zone_id
evaluate_target_health = true
}
}Recovery Process:
- Promote Aurora secondary to primary (1 minute)
- Update Route 53 weights (100% to DR) (1 minute propagation)
- Scale ECS from 50% to 100% capacity (2-3 minutes)
- Total: 5 minutes
Monthly Cost (for production spending $10K/month):
- Aurora global database: $3,000
- DynamoDB global tables: $1,500
- ECS at 50% capacity: $2,500
- Load balancer: $25
- Total: $7,025/month (70% of production)
Tier 4: Multi-Site Active-Active
Characteristics:
- RPO: 0 (no data loss)
- RTO: 0 (no downtime)
- Cost: 200-300% of single-region production
- Availability: 99.99-99.999%
When to use:
- Mission-critical financial systems
- Trading platforms
- Healthcare systems
- Services with contractual 99.99%+ SLAs
Architecture:
ββββββββββββββββ ββββββββββββββββ
β Region 1 ββββββββββΊβ Region 2 β
β β Bi-dir β β
β βββββββββββββ Sync β ββββββββββββ β
β β DynamoDB βββββββββββΊβ β DynamoDB β β
β β Global ββ β β Global β β
β βββββββββββββ β ββββββββββββ β
β β β β
β βββββββββββββ β ββββββββββββ β
β β ECS ββ β β ECS β β
β β(Full Cap)ββ β β(Full Cap)β β
β βββββββββββββ β ββββββββββββ β
β β β β
β Traffic: 50% β β Traffic: 50% β
ββββββββββββββββ ββββββββββββββββ
Both regions handle production traffic; failure of one region is transparent.
Implementation:
# DynamoDB Global Tables (multi-region active-active)
resource "aws_dynamodb_table" "transactions" {
name = "transactions"
billing_mode = "PAY_PER_REQUEST"
hash_key = "transactionId"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "transactionId"
type = "S"
}
# Multi-region replication
replica {
region_name = "us-east-1"
}
replica {
region_name = "us-west-2"
}
replica {
region_name = "eu-west-1"
}
point_in_time_recovery {
enabled = true
}
}
# Aurora Global Database with write forwarding
resource "aws_rds_cluster" "region1" {
cluster_identifier = "region1-cluster"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = "aurora-postgresql"
engine_mode = "provisioned"
database_name = "production"
enable_global_write_forwarding = true # Allow writes from any region
}
# Route 53 latency-based routing
resource "aws_route53_record" "api_region1" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "us-east-1"
latency_routing_policy {
region = "us-east-1"
}
alias {
name = aws_lb.region1.dns_name
zone_id = aws_lb.region1.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "api_region2" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "us-west-2"
latency_routing_policy {
region = "us-west-2"
}
alias {
name = aws_lb.region2.dns_name
zone_id = aws_lb.region2.zone_id
evaluate_target_health = true
}
}
# CloudFront for global distribution
resource "aws_cloudfront_distribution" "api" {
enabled = true
origin {
domain_name = "api.example.com" # Route 53 handles regional routing
origin_id = "api-multi-region"
custom_origin_config {
http_port = 80
https_port = 443
origin_protocol_policy = "https-only"
origin_ssl_protocols = ["TLSv1.2"]
}
}
default_cache_behavior {
allowed_methods = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "api-multi-region"
viewer_protocol_policy = "redirect-to-https"
compress = true
forwarded_values {
query_string = true
headers = ["*"]
cookies {
forward = "all"
}
}
}
restrictions {
geo_restriction {
restriction_type = "none"
}
}
viewer_certificate {
acm_certificate_arn = aws_acm_certificate.cert.arn
ssl_support_method = "sni-only"
minimum_protocol_version = "TLSv1.2_2021"
}
}Recovery Process:
- Automatic: Route 53 health checks detect failure and route traffic to healthy region
- No manual intervention required
- User impact: None (sub-second failover via DNS)
Monthly Cost (for production spending $10K/month):
- Infrastructure in 2+ regions: $20,000
- DynamoDB global tables: $3,000
- Aurora global database: $5,000
- Cross-region data transfer: $2,000
- Total: $30,000/month (300% of single-region production)
Cost vs. Availability Matrix
| DR Tier | RPO | RTO | Availability | Monthly Cost | Use Case |
|---|---|---|---|---|---|
| Backup/Restore | 24h | 24h | 95-99% | 10-15% | Dev/test, low-priority |
| Pilot Light | 1h | 1-4h | 99.5-99.9% | 30-40% | Business-critical apps |
| Warm Standby | 1m | 5m-1h | 99.9-99.95% | 50-70% | High-value services |
| Active-Active | 0 | 0 | 99.99-99.999% | 200-300% | Mission-critical only |
DR Testing Strategy
Automated Testing Framework
# Lambda function for automated DR tests
resource "aws_lambda_function" "dr_test" {
function_name = "dr-automated-test"
runtime = "python3.11"
handler = "index.handler"
timeout = 900 # 15 minutes
environment {
variables = {
PRIMARY_REGION = "us-east-1"
DR_REGION = "us-west-2"
RDS_IDENTIFIER = aws_db_instance.dr_replica.identifier
}
}
}
# EventBridge rule for quarterly DR tests
resource "aws_cloudwatch_event_rule" "dr_test_schedule" {
name = "quarterly-dr-test"
description = "Run DR test every quarter"
schedule_expression = "cron(0 2 1 */3 ? *)" # 2 AM on 1st day of every 3rd month
}
resource "aws_cloudwatch_event_target" "dr_test" {
rule = aws_cloudwatch_event_rule.dr_test_schedule.name
target_id = "DRTestLambda"
arn = aws_lambda_function.dr_test.arn
}DR Test Script:
import boto3
import time
from datetime import datetime
def handler(event, context):
"""
Automated DR test:
1. Promote RDS read replica to standalone
2. Launch ECS tasks in DR region
3. Update Route 53 to point to DR
4. Run smoke tests
5. Rollback
"""
results = {
'test_time': datetime.utcnow().isoformat(),
'steps': []
}
try:
# Step 1: Promote RDS replica (simulation - don't actually promote)
rds = boto3.client('rds', region_name='us-west-2')
start_time = time.time()
# Measure time to promote (dry run)
results['steps'].append({
'step': 'RDS Promotion (simulated)',
'duration_seconds': time.time() - start_time,
'status': 'success'
})
# Step 2: Scale ECS service
ecs = boto3.client('ecs', region_name='us-west-2')
start_time = time.time()
ecs.update_service(
cluster='dr-cluster',
service='app-service-dr',
desiredCount=10 # Scale to production capacity
)
# Wait for tasks to be running
waiter = ecs.get_waiter('services_stable')
waiter.wait(cluster='dr-cluster', services=['app-service-dr'])
results['steps'].append({
'step': 'ECS Scaling',
'duration_seconds': time.time() - start_time,
'status': 'success'
})
# Step 3: Run smoke tests against DR environment
start_time = time.time()
smoke_test_results = run_smoke_tests('dr-region-endpoint')
results['steps'].append({
'step': 'Smoke Tests',
'duration_seconds': time.time() - start_time,
'status': 'success' if smoke_test_results['passed'] else 'failed',
'details': smoke_test_results
})
# Step 4: Rollback - scale ECS back down
ecs.update_service(
cluster='dr-cluster',
service='app-service-dr',
desiredCount=1 # Back to pilot light capacity
)
results['overall_status'] = 'success'
results['total_rto_measured'] = sum(step['duration_seconds'] for step in results['steps'])
except Exception as e:
results['overall_status'] = 'failed'
results['error'] = str(e)
# Publish results to SNS for team notification
sns = boto3.client('sns')
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:dr-test-results',
Subject=f"DR Test Results: {results['overall_status']}",
Message=json.dumps(results, indent=2)
)
return results
def run_smoke_tests(endpoint):
"""Run basic health checks and critical path tests"""
import requests
tests = {
'health_check': f'https://{endpoint}/health',
'auth_flow': f'https://{endpoint}/api/auth/token',
'critical_api': f'https://{endpoint}/api/transactions'
}
results = {'passed': True, 'tests': []}
for test_name, url in tests.items():
try:
response = requests.get(url, timeout=10)
passed = response.status_code == 200
results['tests'].append({
'name': test_name,
'status': 'passed' if passed else 'failed',
'status_code': response.status_code
})
if not passed:
results['passed'] = False
except Exception as e:
results['tests'].append({
'name': test_name,
'status': 'failed',
'error': str(e)
})
results['passed'] = False
return resultsChaos Engineering for DR
# AWS FIS (Fault Injection Simulator) experiment
resource "aws_fis_experiment_template" "region_failure" {
description = "Simulate complete region failure"
role_arn = aws_iam_role.fis.arn
stop_condition {
source = "aws:cloudwatch:alarm"
value = aws_cloudwatch_metric_alarm.error_rate.arn
}
action {
name = "terminate-az"
action_id = "aws:ec2:terminate-instances"
target {
key = "Instances"
value = "production-instances"
}
parameter {
key = "percentage"
value = "100"
}
}
target {
name = "production-instances"
resource_type = "aws:ec2:instance"
selection_mode = "ALL"
resource_tag {
key = "Environment"
value = "production"
}
resource_tag {
key = "Region"
value = "us-east-1"
}
}
}Decision Framework
Step 1: Calculate Business Impact
Downtime Cost = Revenue per Hour Γ RTO Hours
Data Loss Cost = Transactions per Hour Γ RPO Hours Γ Average Transaction Value
Step 2: Determine Required Availability
| Business Impact | Target Availability | DR Tier | Monthly Downtime |
|---|---|---|---|
| <$10K/hour | 95% | Backup/Restore | 36 hours |
| $10K-$100K/hour | 99.5% | Pilot Light | 3.6 hours |
| $100K-$1M/hour | 99.9% | Warm Standby | 43 minutes |
| >$1M/hour | 99.99% | Active-Active | 4.3 minutes |
Step 3: Compare DR Cost vs. Downtime Cost
Example: E-commerce platform
- Current revenue: $100K/hour
- Current DR: None (Tier 0)
- Expected annual downtime: 10 hours
- Annual downtime cost: $1M
DR Investment Options:
| Tier | Monthly Cost | Annual Cost | Expected Downtime | Downtime Cost | Net Benefit |
|---|---|---|---|---|---|
| Pilot Light | $3K | $36K | 2 hours | $200K | +$764K |
| Warm Standby | $7K | $84K | 30 minutes | $50K | +$866K |
| Active-Active | $30K | $360K | 0 | $0 | +$640K |
Winner: Warm Standby (best ROI)
Monitoring DR Readiness
# CloudWatch dashboard for DR metrics
resource "aws_cloudwatch_dashboard" "dr_readiness" {
dashboard_name = "disaster-recovery-readiness"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/RDS", "ReplicaLag", { stat = "Maximum", label = "RDS Replica Lag" }],
["AWS/DynamoDB", "ReplicationLatency", { stat = "Average", label = "DynamoDB Replication" }],
]
period = 300
stat = "Average"
region = "us-east-1"
title = "Data Replication Lag"
yAxis = {
left = {
label = "Seconds"
min = 0
}
}
}
},
{
type = "metric"
properties = {
metrics = [
["AWS/Route53", "HealthCheckStatus", { stat = "Minimum" }],
]
period = 60
stat = "Minimum"
region = "us-east-1"
title = "Route 53 Health Checks"
}
}
]
})
}
# Alert on high replication lag
resource "aws_cloudwatch_metric_alarm" "replication_lag" {
alarm_name = "rds-replication-lag-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ReplicaLag"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 30 # 30 seconds
alarm_description = "RDS replication lag exceeds 30 seconds"
alarm_actions = [aws_sns_topic.ops_alerts.arn]
dimensions = {
DBInstanceIdentifier = aws_db_instance.dr_replica.id
}
}Conclusion: Right-Size Your DR Investment
Most organizations over-invest in active-active architecture for workloads that don't require it, or under-invest with no DR for business-critical systems. The key is matching DR tier to actual business impact:
- Calculate downtime cost: Revenue/hour Γ expected annual downtime
- Evaluate DR tiers: Compare annual DR cost vs. downtime cost reduction
- Choose tier with best ROI: Usually pilot light or warm standby for most applications
- Test quarterly: Automated testing ensures DR works when needed
- Monitor continuously: Replication lag and health checks
For a $10M/year revenue business:
- Downtime cost: ~$5K/hour
- Expected downtime without DR: 20 hours/year = $100K
- Pilot light DR cost: $36K/year
- Net benefit: $64K/year
Don't pay for active-active when warm standby delivers the same business outcome at 1/4 the cost.
Action Items:
- Calculate your actual downtime cost (revenue per hour)
- Audit current DR posture (what tier are you at today?)
- Identify RPO/RTO requirements for each application
- Map applications to appropriate DR tiers
- Implement automated DR testing (start with pilot light for critical apps)
- Run quarterly DR tests and measure actual RTO
- Monitor replication lag continuously
- Document and practice runbooks for failover procedures