Building Multi-Region Active-Active Architectures (Is it Worth It?)
Key takeaways
- Multi-region active-active delivers 99.99%+ uptime but costs 200-300% more than single-region and requires significant engineering investment
- Data consistency challenges—particularly with write conflicts and replication lag—make active-active unsuitable for many applications
- AWS services like DynamoDB Global Tables, Aurora Global Database, and Route 53 provide building blocks, but orchestration is your responsibility
- Most companies should start with cheaper alternatives (99.9% single-region or warm standby) before committing to active-active complexity
- Success requires comprehensive testing including network partitions, regional failures, and split-brain scenarios with automated runbooks
The Active-Active Promise (and Reality Check)
Your CTO just walked into standup with a mandate: "We need multi-region active-active. Our competitor went down for 2 hours during the AWS us-east-1 outage, and sales is on my back."
Before you spin up a second region and call it done, let's talk about what active-active actually means—and whether you really need it.
Active-Active means multiple regions simultaneously serving production traffic, with automatic failover when one region fails. Sounds simple. In practice, it's one of the most complex architectural patterns you can implement.
The Reality of Multi-Region Costs
Let me show you what you're signing up for:
Single-Region Architecture (99.9% SLA):
- Compute: $12K/month (Multi-AZ ECS/Lambda)
- Database: $8K/month (Multi-AZ RDS)
- Networking: $2K/month
- Total: $22K/month
Multi-Region Active-Active (99.99% SLA):
- Compute: $24K/month (2 regions × 100% capacity)
- Database: $20K/month (Aurora Global + replication)
- Networking: $8K/month (cross-region, Global Accelerator)
- Observability: $4K/month (centralized logging, multi-region traces)
- Data replication: $2K/month
- Total: $58K/month
That's a $432K annual premium for an extra "9" of availability.
When Active-Active Actually Makes Sense
Let's cut through the hype. Active-active is justified when downtime cost exceeds implementation cost:
Scenario 1: Financial Services Trading Platform
Downtime Cost:
- Revenue impact: $500K/hour
- SLA penalties: $250K per incident
- Regulatory fines: Potential millions
99.9% uptime = 8.76 hours downtime/year
- Annual downtime cost: $4.3M minimum
99.99% uptime = 52 minutes downtime/year
- Annual downtime cost: $433K
ROI Calculation:
Downtime savings: $4.3M - $433K = $3.87M/year
Active-active cost: $432K/year
Net benefit: $3.4M/year
Verdict: ✅ Justified
Scenario 2: B2B SaaS (Mid-Market)
Downtime Cost:
- Revenue impact: $5K/hour
- SLA credits: $20K per incident (1-2x/year)
- Churn risk: Minimal for occasional outages
99.9% uptime cost:
- Single-region: $264K/year
- Downtime impact: ~$44K + $40K credits = $84K/year
- Total cost of ownership: $348K/year
99.99% uptime cost:
- Active-active: $696K/year
- Downtime impact: ~$4K + minimal credits
- Total cost of ownership: $700K/year
Verdict: ❌ Not justified (Better to invest in Tier 2 Pilot Light: $396K/year with 99.5% uptime)
Scenario 3: High-Traffic E-Commerce
Downtime Cost:
- Revenue: $2M/day = $83K/hour
- Peak season (Q4): $8M/day = $333K/hour
- Brand damage: Significant
Decision:
- Use Tier 3 Warm Standby (99.9%) for 9 months: $34K/month
- Switch to Tier 4 Active-Active for Q4: $58K/month × 3
- Blended annual cost: $480K vs. $696K full-year active-active
Verdict: ✅ Seasonally justified
The Data Consistency Challenge
This is where active-active gets hard. Really hard.
Problem 1: Write Conflicts
User updates their profile in us-east-1 and eu-west-1 simultaneously:
T0: User changes email in us-east-1: "old@example.com" → "new@example.com"
T0: User changes email in eu-west-1: "old@example.com" → "newer@example.com"
T1: Replication arrives in both regions
Result: Which email wins?
DynamoDB Global Tables solution:
- Last Writer Wins (LWW) based on timestamp
- Application must handle conflicts in business logic
Your application must:
// Bad: Assumes write succeeded as expected
await dynamodb.updateItem({
TableName: 'users',
Key: { userId: '123' },
UpdateExpression: 'SET email = :email',
ExpressionAttributeValues: { ':email': 'new@example.com' }
});
// Good: Conditional writes prevent conflicts
await dynamodb.updateItem({
TableName: 'users',
Key: { userId: '123' },
UpdateExpression: 'SET email = :email, version = :newVersion',
ConditionExpression: 'version = :oldVersion',
ExpressionAttributeValues: {
':email': 'new@example.com',
':newVersion': 15,
':oldVersion': 14
}
});Problem 2: Cross-Region Replication Lag
Aurora Global Database replicates in under 1 second—but that's not zero.
Real-world scenario:
T0: User completes payment in us-east-1
↓ Write to Aurora primary
T0: Payment service returns success, sends email
T0: User immediately clicks "View Receipt"
↓ Route 53 latency routing sends to eu-west-1
T0.8s: Request hits eu-west-1 read replica
⚠️ Payment not yet replicated
→ User sees "Payment not found"
Solutions:
Option 1: Session Stickiness
resource "aws_route53_health_check" "regional" {
type = "HTTPS"
resource_path = "/health"
fqdn = "api-${var.region}.example.com"
port = 443
request_interval = 10
failure_threshold = 2
}
resource "aws_route53_record" "latency" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = var.region
# Latency-based routing with health checks
latency_routing_policy {
region = var.region
}
health_check_id = aws_route53_health_check.regional.id
alias {
name = aws_lb.regional.dns_name
zone_id = aws_lb.regional.zone_id
evaluate_target_health = true
}
}Option 2: Write-Region Hints
// After write, include write region in response
const payment = await createPayment(data);
return {
...payment,
_writeRegion: 'us-east-1',
_writeTimestamp: Date.now()
};
// On read, check if we're in the write region
async function getPayment(paymentId: string, writeRegion?: string) {
const currentRegion = process.env.AWS_REGION;
// If read is within 2 seconds and in different region,
// proxy to write region
if (writeRegion && writeRegion !== currentRegion) {
const timeSinceWrite = Date.now() - payment._writeTimestamp;
if (timeSinceWrite < 2000) {
return proxyToRegion(writeRegion, paymentId);
}
}
return await db.getPayment(paymentId);
}Option 3: Primary-Write Pattern
// Only write to primary region, replicas are read-only
const WRITE_REGION = 'us-east-1';
async function updateUser(userId: string, data: any) {
if (process.env.AWS_REGION !== WRITE_REGION) {
// Forward write to primary region
return await httpClient.post(
`https://api-${WRITE_REGION}.internal.example.com/users/${userId}`,
data
);
}
// Execute write in primary
return await db.updateUser(userId, data);
}Problem 3: Distributed Transactions
You can't do this in active-active:
// ❌ This pattern breaks in multi-region
await db.transaction(async (trx) => {
await trx('accounts').where({id: 1}).decrement('balance', 100);
await trx('accounts').where({id: 2}).increment('balance', 100);
await trx('ledger').insert({from: 1, to: 2, amount: 100});
});Why it breaks:
- Each region needs independent transactions
- Cross-region coordination = seconds of latency
- Network partition = split-brain
Solution: Event sourcing + eventual consistency
// Instead: Record intent, reconcile async
interface TransferCommand {
transferId: string;
fromAccount: string;
toAccount: string;
amount: number;
region: string;
timestamp: number;
}
// Each region writes to DynamoDB Global Table
await dynamodb.putItem({
TableName: 'transfer_commands',
Item: {
transferId: ulid(),
fromAccount: '1',
toAccount: '2',
amount: 100,
region: process.env.AWS_REGION,
timestamp: Date.now(),
status: 'PENDING'
}
});
// Background processor reconciles (Saga pattern)
async function processTransfer(command: TransferCommand) {
try {
// 1. Reserve funds
await reserveFunds(command.fromAccount, command.amount);
// 2. Execute transfer (idempotent)
await executeTransfer(command);
// 3. Mark complete
await markTransferComplete(command.transferId);
} catch (error) {
await rollbackTransfer(command.transferId);
}
}Implementation Pattern: Full Stack Active-Active
Here's a production-ready multi-region setup:
Step 1: Data Layer (DynamoDB Global Tables)
# Primary region (us-east-1)
resource "aws_dynamodb_table" "users" {
name = "users"
billing_mode = "PAY_PER_REQUEST"
hash_key = "userId"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "userId"
type = "S"
}
# Enable Point-in-Time Recovery
point_in_time_recovery {
enabled = true
}
# Global table configuration
replica {
region_name = "eu-west-1"
}
replica {
region_name = "ap-southeast-1"
}
}Step 2: Compute Layer (ECS + Global Accelerator)
# Deploy in each region
module "api_useast1" {
source = "./modules/api"
providers = {
aws = aws.useast1
}
region = "us-east-1"
vpc_id = aws_vpc.useast1.id
desired_count = 6 # Full capacity
image = var.api_image
database_endpoint = aws_rds_cluster.primary.endpoint
}
module "api_euwest1" {
source = "./modules/api"
providers = {
aws = aws.euwest1
}
region = "eu-west-1"
vpc_id = aws_vpc.euwest1.id
desired_count = 6 # Full capacity
image = var.api_image
database_endpoint = aws_rds_cluster.secondary.endpoint
}
# AWS Global Accelerator for anycast routing
resource "aws_globalaccelerator_accelerator" "main" {
name = "api-accelerator"
ip_address_type = "IPV4"
enabled = true
}
resource "aws_globalaccelerator_listener" "https" {
accelerator_arn = aws_globalaccelerator_accelerator.main.id
protocol = "TCP"
port_range {
from_port = 443
to_port = 443
}
}
resource "aws_globalaccelerator_endpoint_group" "useast1" {
listener_arn = aws_globalaccelerator_listener.https.id
endpoint_group_region = "us-east-1"
endpoint_configuration {
endpoint_id = aws_lb.useast1.arn
weight = 100
client_ip_preservation_enabled = true
}
health_check_interval_seconds = 10
health_check_path = "/health"
health_check_port = 443
health_check_protocol = "HTTPS"
threshold_count = 2
traffic_dial_percentage = 100
}
resource "aws_globalaccelerator_endpoint_group" "euwest1" {
listener_arn = aws_globalaccelerator_listener.https.id
endpoint_group_region = "eu-west-1"
endpoint_configuration {
endpoint_id = aws_lb.euwest1.arn
weight = 100
}
health_check_interval_seconds = 10
health_check_path = "/health"
health_check_port = 443
health_check_protocol = "HTTPS"
threshold_count = 2
traffic_dial_percentage = 100
}Step 3: Database Layer (Aurora Global)
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "global-production"
engine = "aurora-postgresql"
engine_version = "14.6"
database_name = "production"
}
# Primary cluster in us-east-1
resource "aws_rds_cluster" "primary" {
provider = aws.useast1
cluster_identifier = "primary-cluster"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = aws_rds_global_cluster.main.engine
engine_version = aws_rds_global_cluster.main.engine_version
database_name = aws_rds_global_cluster.main.database_name
master_username = var.db_username
master_password = var.db_password
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
}
resource "aws_rds_cluster_instance" "primary" {
provider = aws.useast1
count = 2
identifier = "primary-instance-${count.index}"
cluster_identifier = aws_rds_cluster.primary.id
instance_class = "db.r6g.2xlarge"
engine = aws_rds_cluster.primary.engine
engine_version = aws_rds_cluster.primary.engine_version
}
# Secondary cluster in eu-west-1
resource "aws_rds_cluster" "secondary" {
provider = aws.euwest1
cluster_identifier = "secondary-cluster"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = aws_rds_global_cluster.main.engine
engine_version = aws_rds_global_cluster.main.engine_version
# Replica cluster - no master credentials
depends_on = [aws_rds_cluster_instance.primary]
}
resource "aws_rds_cluster_instance" "secondary" {
provider = aws.euwest1
count = 2
identifier = "secondary-instance-${count.index}"
cluster_identifier = aws_rds_cluster.secondary.id
instance_class = "db.r6g.2xlarge"
engine = aws_rds_cluster.secondary.engine
engine_version = aws_rds_cluster.secondary.engine_version
}Step 4: Application Health Checks
Deep health check (not just HTTP 200):
import { DynamoDB, RDS } from 'aws-sdk';
import { IncomingMessage, ServerResponse } from 'http';
interface HealthStatus {
status: 'healthy' | 'degraded' | 'unhealthy';
timestamp: string;
region: string;
checks: {
database: HealthCheck;
dynamodb: HealthCheck;
crossRegionReplication: HealthCheck;
};
}
interface HealthCheck {
status: 'pass' | 'fail';
latency: number;
details?: string;
}
export async function healthHandler(
req: IncomingMessage,
res: ServerResponse
): Promise<void> {
const startTime = Date.now();
const region = process.env.AWS_REGION || 'unknown';
const checks = await Promise.all([
checkDatabase(),
checkDynamoDB(),
checkCrossRegionReplication()
]);
const [database, dynamodb, crossRegionReplication] = checks;
// Determine overall health
const allPassing = checks.every(c => c.status === 'pass');
const anyFailing = checks.some(c => c.status === 'fail');
const health: HealthStatus = {
status: anyFailing ? 'unhealthy' : allPassing ? 'healthy' : 'degraded',
timestamp: new Date().toISOString(),
region,
checks: { database, dynamodb, crossRegionReplication }
};
// Return 200 only if healthy, 503 otherwise
// This allows Global Accelerator to fail over
const statusCode = health.status === 'healthy' ? 200 : 503;
res.writeHead(statusCode, { 'Content-Type': 'application/json' });
res.end(JSON.stringify(health));
}
async function checkDatabase(): Promise<HealthCheck> {
const start = Date.now();
try {
const result = await db.raw('SELECT 1 as health');
return {
status: 'pass',
latency: Date.now() - start
};
} catch (error) {
return {
status: 'fail',
latency: Date.now() - start,
details: error.message
};
}
}
async function checkDynamoDB(): Promise<HealthCheck> {
const start = Date.now();
try {
await dynamodb.getItem({
TableName: 'health-checks',
Key: { id: { S: 'health' } }
}).promise();
return {
status: 'pass',
latency: Date.now() - start
};
} catch (error) {
return {
status: 'fail',
latency: Date.now() - start,
details: error.message
};
}
}
async function checkCrossRegionReplication(): Promise<HealthCheck> {
const start = Date.now();
const testKey = `repl-test-${Date.now()}`;
try {
// Write test item in current region
await dynamodb.putItem({
TableName: 'health-checks',
Item: {
id: { S: testKey },
timestamp: { N: Date.now().toString() },
region: { S: process.env.AWS_REGION }
}
}).promise();
// Wait 500ms for replication
await new Promise(resolve => setTimeout(resolve, 500));
// Check if item exists (validates local write path)
const result = await dynamodb.getItem({
TableName: 'health-checks',
Key: { id: { S: testKey } },
ConsistentRead: true
}).promise();
if (!result.Item) {
throw new Error('Test item not found after write');
}
const latency = Date.now() - start;
// Warn if replication is slow (but don't fail)
return {
status: latency > 2000 ? 'fail' : 'pass',
latency,
details: latency > 1000 ? 'Replication lag detected' : undefined
};
} catch (error) {
return {
status: 'fail',
latency: Date.now() - start,
details: error.message
};
}
}Common Pitfalls (and How to Avoid Them)
Pitfall 1: Forgetting About Asset Storage
You set up Aurora Global and DynamoDB Global Tables, but your S3 bucket is in us-east-1 only.
Problem:
- EU users upload to
us-east-1bucket - Cross-region latency: 100-150ms for each PUT/GET
- Data transfer costs: $0.02/GB
Solution: S3 Cross-Region Replication + CloudFront
# Primary bucket in us-east-1
resource "aws_s3_bucket" "primary" {
provider = aws.useast1
bucket = "assets-useast1"
versioning {
enabled = true
}
}
# Replica bucket in eu-west-1
resource "aws_s3_bucket" "replica" {
provider = aws.euwest1
bucket = "assets-euwest1"
versioning {
enabled = true
}
}
# Replication configuration
resource "aws_s3_bucket_replication_configuration" "primary_to_replica" {
provider = aws.useast1
role = aws_iam_role.replication.arn
bucket = aws_s3_bucket.primary.id
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.replica.arn
storage_class = "STANDARD"
replication_time {
status = "Enabled"
time {
minutes = 15
}
}
metrics {
status = "Enabled"
event_threshold {
minutes = 15
}
}
}
}
}
# CloudFront with regional origins
resource "aws_cloudfront_distribution" "assets" {
enabled = true
origin {
domain_name = aws_s3_bucket.primary.bucket_regional_domain_name
origin_id = "S3-useast1"
s3_origin_config {
origin_access_identity = aws_cloudfront_origin_access_identity.main.cloudfront_access_identity_path
}
}
origin {
domain_name = aws_s3_bucket.replica.bucket_regional_domain_name
origin_id = "S3-euwest1"
s3_origin_config {
origin_access_identity = aws_cloudfront_origin_access_identity.main.cloudfront_access_identity_path
}
}
# Origin group for failover
origin_group {
origin_id = "S3-group"
failover_criteria {
status_codes = [403, 404, 500, 502, 503, 504]
}
member {
origin_id = "S3-useast1"
}
member {
origin_id = "S3-euwest1"
}
}
default_cache_behavior {
target_origin_id = "S3-group"
viewer_protocol_policy = "redirect-to-https"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD"]
forwarded_values {
query_string = false
cookies {
forward = "none"
}
}
min_ttl = 0
default_ttl = 86400
max_ttl = 31536000
}
}For uploads: Use presigned URLs with regional S3 endpoints:
import { S3 } from 'aws-sdk';
async function generateUploadUrl(fileName: string): Promise<string> {
// Route to nearest regional bucket
const region = process.env.AWS_REGION;
const bucket = `assets-${region}`;
const s3 = new S3({ region });
return s3.getSignedUrlPromise('putObject', {
Bucket: bucket,
Key: fileName,
Expires: 300, // 5 minutes
ContentType: 'image/jpeg'
});
}Pitfall 2: Centralized Logging Becomes a Bottleneck
Each region generates 500GB/day of logs. You ship everything to us-east-1 OpenSearch.
Problem:
- Cross-region data transfer: 500GB × $0.02 = $10/day = $3,650/year per region
- OpenSearch cluster in one region = single point of failure
Solution: Regional logging + centralized dashboards
# Regional OpenSearch cluster in each region
module "opensearch_useast1" {
source = "./modules/opensearch"
providers = { aws = aws.useast1 }
cluster_name = "logs-useast1"
region = "us-east-1"
}
module "opensearch_euwest1" {
source = "./modules/opensearch"
providers = { aws = aws.euwest1 }
cluster_name = "logs-euwest1"
region = "eu-west-1"
}
# Kinesis Firehose in each region (stays local)
resource "aws_kinesis_firehose_delivery_stream" "logs_useast1" {
provider = aws.useast1
name = "logs-stream"
destination = "elasticsearch"
elasticsearch_configuration {
domain_arn = module.opensearch_useast1.domain_arn
role_arn = aws_iam_role.firehose.arn
index_name = "logs"
}
}
# Grafana for cross-region queries
resource "aws_grafana_workspace" "main" {
account_access_type = "CURRENT_ACCOUNT"
authentication_providers = ["AWS_SSO"]
permission_type = "SERVICE_MANAGED"
data_sources = ["AMAZON_OPENSEARCH_SERVICE"]
}Pitfall 3: Secrets Management Across Regions
Your app needs database passwords, API keys, and encryption keys in both regions.
Problem:
- Secrets Manager doesn't auto-replicate across regions
- Manual replication = drift risk
Solution: Multi-region secrets with replication
resource "aws_secretsmanager_secret" "db_password" {
provider = aws.useast1
name = "production/db/password"
replica {
region = "eu-west-1"
}
replica {
region = "ap-southeast-1"
}
}
resource "aws_secretsmanager_secret_version" "db_password" {
provider = aws.useast1
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = random_password.db.result
}
# Rotate secrets with Lambda (runs in primary region only)
resource "aws_secretsmanager_secret_rotation" "db_password" {
provider = aws.useast1
secret_id = aws_secretsmanager_secret.db_password.id
rotation_lambda_arn = aws_lambda_function.rotate_secret.arn
rotation_rules {
automatically_after_days = 30
}
}Testing Strategy: Chaos Engineering
You haven't built active-active until you've tested regional failure. Here's how:
Test 1: Regional Failover
import boto3
import time
def test_regional_failover():
"""Simulate us-east-1 failure, verify eu-west-1 takeover"""
# 1. Baseline: Both regions healthy
assert check_health('us-east-1') == 'healthy'
assert check_health('eu-west-1') == 'healthy'
# 2. Simulate us-east-1 failure (block health check)
block_health_check('us-east-1')
# 3. Wait for Global Accelerator to detect failure (30 seconds)
time.sleep(35)
# 4. Verify traffic shifted to eu-west-1
for i in range(10):
region = make_request('https://api.example.com/health')['region']
assert region == 'eu-west-1', f"Request {i} routed to {region}"
# 5. Verify data consistency
user_id = create_test_user()
time.sleep(2) # Replication lag
user = get_user(user_id)
assert user is not None
# 6. Restore us-east-1
unblock_health_check('us-east-1')
time.sleep(35)
# 7. Verify traffic rebalances
regions = set()
for i in range(100):
region = make_request('https://api.example.com/health')['region']
regions.add(region)
assert 'us-east-1' in regions
assert 'eu-west-1' in regions
def block_health_check(region: str):
"""Block ALB health check to simulate regional failure"""
ec2 = boto3.client('ec2', region_name=region)
# Modify security group to block health check
response = ec2.describe_security_groups(
Filters=[{'Name': 'tag:Name', 'Values': ['alb-sg']}]
)
sg_id = response['SecurityGroups'][0]['GroupId']
# Remove health check ingress rule
ec2.revoke_security_group_ingress(
GroupId=sg_id,
IpPermissions=[{
'IpProtocol': 'tcp',
'FromPort': 443,
'ToPort': 443,
'IpRanges': [{'CidrIp': '0.0.0.0/0'}]
}]
)Test 2: Network Partition (Split-Brain)
def test_network_partition():
"""Verify graceful degradation during network partition"""
# 1. Create test data in us-east-1
user_id = create_user_in_region('us-east-1', {
'email': 'test@example.com',
'version': 1
})
# 2. Wait for replication
time.sleep(2)
# 3. Block cross-region traffic (VPC peering)
block_cross_region_traffic('us-east-1', 'eu-west-1')
# 4. Attempt conflicting writes in both regions
update_user_in_region('us-east-1', user_id, {
'email': 'updated-us@example.com',
'version': 2
})
update_user_in_region('eu-west-1', user_id, {
'email': 'updated-eu@example.com',
'version': 2
})
# 5. Restore connectivity
unblock_cross_region_traffic('us-east-1', 'eu-west-1')
# 6. Wait for conflict resolution (Last Write Wins)
time.sleep(5)
# 7. Verify both regions converged to same value
user_us = get_user_from_region('us-east-1', user_id)
user_eu = get_user_from_region('eu-west-1', user_id)
assert user_us['email'] == user_eu['email']
assert user_us['version'] == user_eu['version']
# 8. Check CloudWatch metrics for conflict alarm
assert check_metric_alarm('DynamoDB-ConflictCount') == 'ALARM'Test 3: Database Promotion
def test_aurora_promotion():
"""Verify secondary Aurora cluster can be promoted to primary"""
# 1. Current state
primary = get_aurora_cluster('us-east-1', 'primary-cluster')
secondary = get_aurora_cluster('eu-west-1', 'secondary-cluster')
assert primary['GlobalWriteForwardingStatus'] == 'disabled'
assert secondary['GlobalWriteForwardingStatus'] == 'enabled'
# 2. Simulate primary region failure
block_aurora_traffic('us-east-1')
# 3. Promote secondary to primary
rds = boto3.client('rds', region_name='eu-west-1')
rds.failover_global_cluster(
GlobalClusterIdentifier='global-production',
TargetDbClusterIdentifier='secondary-cluster'
)
# 4. Wait for promotion (can take 2-5 minutes)
wait_for_aurora_promotion('eu-west-1', 'secondary-cluster', timeout=300)
# 5. Verify writes work in new primary
user_id = create_test_user() # Should route to eu-west-1
assert user_id is not None
# 6. Verify old primary became read-only
# (or is unreachable due to region failure)
time.sleep(60) # Wait for DNS propagation
# 7. Restore and verify dual-region operation
unblock_aurora_traffic('us-east-1')
time.sleep(300) # Replication catch-up
# Verify data consistency
user = get_user(user_id)
assert user is not NoneCost Optimization Strategies
Even with active-active, you can reduce costs:
1. Right-Size by Traffic Patterns
Not all regions need equal capacity:
locals {
# Traffic distribution: 60% US, 30% EU, 10% APAC
capacity_by_region = {
"us-east-1" = 12 # 60% of 20 total instances
"eu-west-1" = 6 # 30%
"ap-southeast-1" = 2 # 10%
}
}
module "api" {
for_each = local.capacity_by_region
source = "./modules/api"
providers = {
aws = aws[each.key]
}
region = each.key
desired_count = each.value
}Savings: $8K/month (vs. equal capacity)
2. Use Aurora Global Database Tier 2/3
You don't need 2 full writer instances:
# Primary region: Full HA setup
resource "aws_rds_cluster_instance" "primary" {
count = 2 # Multi-AZ writer + reader
...
}
# Secondary regions: Read replicas only (cheaper)
resource "aws_rds_cluster_instance" "secondary" {
count = 1 # Single reader (promote during failover)
instance_class = "db.r6g.large" # Smaller than primary
...
}Savings: $4.8K/month
3. Regional Lambda Quotas
// Primary region: Full quota
// us-east-1: 1000 concurrent executions
// Secondary regions: Reserved concurrency for critical functions only
const criticalFunctions = [
'process-payment',
'send-notification',
'auth-handler'
];
// Reserve 200 concurrent executions each
for (const fn of criticalFunctions) {
new lambda.Function(this, fn, {
reservedConcurrentExecutions: 200,
...
});
}Decision Framework
Use this checklist to decide if active-active is right for you:
✅ You Should Build Active-Active If:
- Downtime costs exceed $50K/hour
- You have regulatory requirements for multi-region
- Your application is mostly read-heavy (easier to replicate)
- You have 3+ senior platform engineers
- Your data model supports eventual consistency
- You've tested failover quarterly for the past year
- You have budget for 2-3× infrastructure costs
❌ You Should NOT Build Active-Active If:
- You haven't achieved 99.9% uptime in single-region
- Your application requires strong consistency (ACID transactions)
- Downtime costs under $10K/hour
- You have fewer than 2 platform engineers
- Your database has complex foreign key relationships
- You've never tested a disaster recovery scenario
- You're optimizing for "resume-driven development"
Conclusion: Start Small, Prove Value
Multi-region active-active is the most complex architecture pattern you can implement. Most companies that build it discover they didn't need it.
Better approach:
Year 1: Achieve 99.9% in single-region
- Multi-AZ everything
- Automated failover
- Comprehensive monitoring
Year 2: Add Tier 2 Pilot Light in second region
- Cost: 30-40% of single-region
- Recovery time: 1-4 hours
- Proves multi-region tooling works
Year 3: Upgrade to Tier 3 Warm Standby
- Cost: 50-70% of single-region
- Recovery time: 5-60 minutes
- Most companies stop here
Year 4+: Active-active only if downtime costs justify it
- Measure actual downtime impact over 3 years
- Calculate ROI vs. Tier 3 warm standby
- Get executive buy-in for 2-3× cost increase
The hardest part of active-active isn't the infrastructure—it's the operational discipline to test failover quarterly, monitor replication lag in real-time, and handle edge cases gracefully.
Build boring, reliable systems first. Add complexity only when the business case is ironclad.
Action Items
- Calculate your downtime cost: Revenue/hour + SLA penalties + customer churn risk
- Audit your current availability: What's your actual uptime over the past 12 months?
- Estimate active-active cost: 2-3× current infrastructure + engineering time
- Compare alternatives: Can Tier 2 (Pilot Light) or Tier 3 (Warm Standby) meet your needs?
- Build a 6-month proof of concept: Start with warm standby, measure operational overhead
If you need help architecting a multi-region strategy—or deciding whether you actually need one—schedule a consultation. We'll review your architecture, calculate your true downtime cost, and recommend the right DR tier for your business.