Multi-Region High Availability on AWS: Active-Active vs Active-Passive Design

May 27, 2026 • 22 min read Architecture HighAvailability MultiRegion

AWS Series | Part 18 — Building secure, cost-optimised, cloud-native infrastructure on AWS.

Multi-Region High Availability on AWS — Active-Active vs Active-Passive Architecture

TL;DR Comparison

	Active-Passive	Active-Active
Traffic distribution	All traffic to primary region, standby secondary	Traffic split across both regions simultaneously
RTO	Minutes (DNS failover + warm standby)	Near-zero (traffic reroutes instantly)
RPO	Seconds to minutes (replication lag)	Near-zero (synchronous or near-sync replication)
Cost	~1.5× single-region (standby is scaled down)	~2× single-region (full capacity in both)
Complexity	Medium	High
Data consistency	Eventual (async replication)	Strong or eventual depending on pattern
Best for	DR, regulated workloads, cost-sensitive HA	Global platforms, zero-tolerance downtime, latency-sensitive
Typical use case	Enterprise applications, financial services DR	Global APIs, gaming, real-time platforms

Introduction — When Single-Region Is No Longer Enough

Every architecture in this series has been built within a single AWS region. Multi-AZ deployment, Karpenter across three availability zones, ALBs distributing traffic across zones — all of this protects against AZ-level failures. But an entire AWS region going down is not theoretical. It has happened. And when it does, a single-region architecture goes with it.

Multi-region is not just about disaster recovery. It is about latency, data residency, compliance, and the architecture decisions that determine whether your platform survives its worst day. A fraud detection platform processing transactions across Europe needs consistent sub-200ms response times regardless of where the transaction originates. A platform that fails for 40 minutes while DNS propagates during a region failover is not meeting its SLAs — even if it eventually recovers.

This post covers both patterns in depth — Active-Passive for controlled failover with lower cost, and Active-Active for zero-RPO, zero-RTO platforms — with the Terraform, Route 53 routing policies, database replication strategies, and operational considerations for each. Every pattern connects back to the infrastructure built across this series.

1. The Multi-Region Building Blocks

Before choosing Active-Passive or Active-Active, you need the same set of underlying components for both. These are the building blocks every multi-region architecture shares.

Multi-Region VPC and Networking

# Primary region — eu-west-1
module "vpc_primary" {
  source  = "../../modules/vpc"

  vpc_cidr    = "10.0.0.0/16"
  region      = "eu-west-1"
  environment = var.environment

  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  tags = local.common_tags
}

# Secondary region — eu-central-1 (Frankfurt)
# Non-overlapping CIDR — critical for Transit Gateway peering
module "vpc_secondary" {
  source  = "../../modules/vpc"

  vpc_cidr    = "10.1.0.0/16"
  region      = "eu-central-1"
  environment = var.environment

  private_subnets = ["10.1.1.0/24", "10.1.2.0/24", "10.1.3.0/24"]
  public_subnets  = ["10.1.101.0/24", "10.1.102.0/24", "10.1.103.0/24"]

  tags = local.common_tags
}

Inter-Region Connectivity — TGW Peering

# Transit Gateway in primary region
resource "aws_ec2_transit_gateway" "primary" {
  provider    = aws.eu_west_1
  description = "Primary region TGW — eu-west-1"

  amazon_side_asn                 = 64512
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"

  tags = { Name = "tgw-primary-eu-west-1" }
}

# Transit Gateway in secondary region
resource "aws_ec2_transit_gateway" "secondary" {
  provider    = aws.eu_central_1
  description = "Secondary region TGW — eu-central-1"

  amazon_side_asn                 = 64513
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"

  tags = { Name = "tgw-secondary-eu-central-1" }
}

# TGW peering — cross-region private backbone
resource "aws_ec2_transit_gateway_peering_attachment" "primary_to_secondary" {
  provider = aws.eu_west_1

  transit_gateway_id      = aws_ec2_transit_gateway.primary.id
  peer_transit_gateway_id = aws_ec2_transit_gateway.secondary.id
  peer_region             = "eu-central-1"

  tags = { Name = "tgw-peering-eu-west-1-to-eu-central-1" }
}

# Accept the peering in the secondary region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "secondary" {
  provider = aws.eu_central_1

  transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.primary_to_secondary.id

  tags = { Name = "tgw-peering-accepter-eu-central-1" }
}

# Static routes — propagation does not work across TGW peering
resource "aws_ec2_transit_gateway_route" "primary_to_secondary_cidrs" {
  provider = aws.eu_west_1

  destination_cidr_block         = "10.1.0.0/16"
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_peering_attachment.primary_to_secondary.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

resource "aws_ec2_transit_gateway_route" "secondary_to_primary_cidrs" {
  provider = aws.eu_central_1

  destination_cidr_block         = "10.0.0.0/16"
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_peering_attachment.primary_to_secondary.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.secondary.id
}

Terraform Provider Configuration — Multi-Region

# providers.tf — multi-region provider setup
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.50"
    }
  }
}

provider "aws" {
  alias  = "eu_west_1"
  region = "eu-west-1"

  assume_role {
    role_arn = "arn:aws:iam::${var.account_id}:role/terraform-apply-role"
  }
}

provider "aws" {
  alias  = "eu_central_1"
  region = "eu-central-1"

  assume_role {
    role_arn = "arn:aws:iam::${var.account_id}:role/terraform-apply-role"
  }
}

# Route 53 and ACM validation always in us-east-1
provider "aws" {
  alias  = "us_east_1"
  region = "us-east-1"

  assume_role {
    role_arn = "arn:aws:iam::${var.account_id}:role/terraform-apply-role"
  }
}

2. Active-Passive — Controlled Failover with Lower Cost

Architecture Overview

Normal operation:
  Users → Route 53 → Primary ALB (eu-west-1) → EKS Cluster Primary
  Secondary region: warm standby — scaled down, receives no traffic

Failover:
  Route 53 health check detects primary unhealthy
  → DNS fails over to Secondary ALB (eu-central-1)
  → EKS Cluster Secondary scales up (if not already warm)
  → Traffic flows to secondary region

The secondary region is always running — but at reduced capacity. The minimum configuration that allows fast failover (within minutes, not hours) is:

EKS cluster with system node group running (control plane warm)
Minimum 1 On-Demand node per AZ (avoids cold bootstrap time)
Application deployments present but scaled to minimum replicas
Database in standby or read replica mode with promotion capability
Route 53 health checks monitoring the primary ALB

Route 53 Failover Routing

# Health check — monitors the primary region ALB
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_lb.primary.dns_name
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  regions = [
    "eu-west-1",
    "eu-central-1",
    "us-east-1"
  ]

  tags = { Name = "primary-region-health-check" }
}

# Primary DNS record — eu-west-1
resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.public.zone_id
  name    = "api.company.com"
  type    = "A"

  set_identifier = "primary"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
}

# Secondary DNS record — eu-central-1
resource "aws_route53_record" "secondary" {
  provider = aws.eu_central_1

  zone_id = aws_route53_zone.public.zone_id
  name    = "api.company.com"
  type    = "A"

  set_identifier = "secondary"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
  # No health_check_id on secondary — always available as fallback
}

Database Replication — RDS Multi-Region

# Primary RDS instance — eu-west-1
resource "aws_db_instance" "primary" {
  provider = aws.eu_west_1

  identifier        = "platform-db-primary"
  engine            = "postgres"
  engine_version    = "16.2"
  instance_class    = "db.r6g.xlarge"

  multi_az               = true
  storage_encrypted      = true
  deletion_protection    = true

  backup_retention_period = 7
  backup_window          = "03:00-04:00"

  monitoring_interval = 30
  monitoring_role_arn = aws_iam_role.rds_monitoring.arn

  tags = local.common_tags
}

# Cross-region read replica — eu-central-1
# Can be promoted to standalone primary during failover
resource "aws_db_instance" "secondary_replica" {
  provider = aws.eu_central_1

  identifier = "platform-db-secondary-replica"

  replicate_source_db = aws_db_instance.primary.arn

  instance_class = "db.r6g.large"

  multi_az            = false
  publicly_accessible = false
  storage_encrypted   = true
  deletion_protection = true

  tags = merge(local.common_tags, { Role = "read-replica" })
}

# CloudWatch alarm — replica lag exceeds RPO threshold
resource "aws_cloudwatch_metric_alarm" "replica_lag" {
  provider = aws.eu_central_1

  alarm_name          = "rds-replica-lag-exceeds-rpo"
  alarm_description   = "Replica lag exceeds RPO threshold — failover data loss risk"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ReplicaLag"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Maximum"
  threshold           = 30

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.secondary_replica.id
  }

  alarm_actions = [aws_sns_topic.platform_alerts_secondary.arn]
}

Failover Runbook — Automated vs Manual

For Active-Passive, the failover itself can be automated (DNS health checks handle it) but the database promotion is typically manual in regulated environments because it is irreversible without re-establishing replication.

# EventBridge rule — detect primary region ALB alarm, trigger failover notification
resource "aws_cloudwatch_event_rule" "primary_unhealthy" {
  provider = aws.eu_west_1
  name     = "primary-region-health-failure"

  event_pattern = jsonencode({
    source      = ["aws.route53"]
    detail-type = ["Route 53 Health Check Status Change"]
    detail = {
      status          = ["ALARM"]
      health-check-id = [aws_route53_health_check.primary.id]
    }
  })
}

resource "aws_cloudwatch_event_target" "failover_notification" {
  provider = aws.eu_west_1
  rule     = aws_cloudwatch_event_rule.primary_unhealthy.name
  arn      = aws_sns_topic.failover_alerts.arn
}

Why manual database promotion in production: Automatic promotion of a read replica is irreversible — once promoted, it is a standalone instance. Re-establishing it as a replica requires a fresh backup restore which takes hours. In a regulated environment, this is a decision that requires a human to confirm the primary region is genuinely down and will not recover within the RTO window, before committing to promotion.

3. Active-Active — Zero-RPO, Zero-RTO Architecture

Architecture Overview

Normal operation:
  EU users   → Route 53 (latency routing) → eu-west-1 ALB  → eu-west-1 EKS
  DACH users → Route 53 (latency routing) → eu-central-1 ALB → eu-central-1 EKS

Both regions:
  - Full production capacity
  - Receiving live traffic simultaneously
  - Data writes synchronised in near-real-time

Region failure:
  Route 53 removes unhealthy region from DNS within ~30 seconds
  Remaining region absorbs all traffic
  No data loss (synchronous or near-synchronous replication)

Route 53 Latency-Based Routing with Health Checks

# Health checks — both regions monitored
resource "aws_route53_health_check" "eu_west_1" {
  fqdn              = aws_lb.eu_west_1.dns_name
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  tags = { Name = "health-check-eu-west-1" }
}

resource "aws_route53_health_check" "eu_central_1" {
  fqdn              = aws_lb.eu_central_1.dns_name
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10

  tags = { Name = "health-check-eu-central-1" }
}

# Latency routing — direct users to nearest healthy region
resource "aws_route53_record" "api_eu_west_1" {
  zone_id        = aws_route53_zone.public.zone_id
  name           = "api.company.com"
  type           = "A"
  set_identifier = "eu-west-1"

  latency_routing_policy {
    region = "eu-west-1"
  }

  alias {
    name                   = aws_lb.eu_west_1.dns_name
    zone_id                = aws_lb.eu_west_1.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.eu_west_1.id
}

resource "aws_route53_record" "api_eu_central_1" {
  zone_id        = aws_route53_zone.public.zone_id
  name           = "api.company.com"
  type           = "A"
  set_identifier = "eu-central-1"

  latency_routing_policy {
    region = "eu-central-1"
  }

  alias {
    name                   = aws_lb.eu_central_1.dns_name
    zone_id                = aws_lb.eu_central_1.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.eu_central_1.id
}

The Hard Problem — Data in Active-Active

Active-Active compute is straightforward. Active-Active data is the genuinely hard part. If a write happens in eu-west-1 and a read for the same record happens in eu-central-1 100ms later, does the read see the write? The answer depends entirely on your replication strategy. Three patterns:

Pattern 1 — Global Tables (DynamoDB)

Fully managed multi-region replication with eventual consistency. Writes in any region propagate to all regions, typically within 1 second. Best for use cases where eventual consistency is acceptable.

resource "aws_dynamodb_table" "global" {
  name         = "platform-sessions"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "session_id"

  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "session_id"
    type = "S"
  }

  replica {
    region_name = "eu-central-1"
    kms_key_arn = aws_kms_key.dynamodb_eu_central_1.arn
  }

  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.dynamodb_eu_west_1.arn
  }

  tags = local.common_tags
}

Pattern 2 — Aurora Global Database

Sub-second cross-region replication with the ability to promote the secondary in under 1 minute during failover. The best option for relational data with strong consistency requirements.

# Aurora Global Cluster
resource "aws_rds_global_cluster" "main" {
  global_cluster_identifier = "platform-global-db"
  engine                    = "aurora-postgresql"
  engine_version            = "16.2"
  database_name             = "platform"
  storage_encrypted         = true
}

# Primary cluster — eu-west-1
resource "aws_rds_cluster" "primary" {
  provider = aws.eu_west_1

  cluster_identifier        = "platform-primary-eu-west-1"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = "aurora-postgresql"
  engine_version            = "16.2"

  db_subnet_group_name   = aws_db_subnet_group.primary.name
  vpc_security_group_ids = [aws_security_group.rds_primary.id]

  master_username = var.db_username
  master_password = var.db_password

  backup_retention_period = 7

  tags = local.common_tags
}

# Secondary cluster — eu-central-1
# Replicates from primary with typical lag <1 second
resource "aws_rds_cluster" "secondary" {
  provider = aws.eu_central_1

  cluster_identifier        = "platform-secondary-eu-central-1"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = "aurora-postgresql"
  engine_version            = "16.2"

  db_subnet_group_name   = aws_db_subnet_group.secondary.name
  vpc_security_group_ids = [aws_security_group.rds_secondary.id]

  tags = local.common_tags
}

Pattern 3 — Write to Primary, Read from Local Replica

A hybrid approach where all writes go to one region (eliminating write conflicts) and reads are served from the nearest replica. Simpler consistency model, slightly higher write latency for the non-primary region.

import boto3
import os

class MultiRegionDBClient:
    def __init__(self):
        self.region = os.environ.get('AWS_REGION', 'eu-west-1')
        self.primary_region = 'eu-west-1'

        # Write endpoint — always primary region
        self.write_client = boto3.client(
            'rds-data',
            region_name=self.primary_region,
            endpoint_url=os.environ['AURORA_WRITE_ENDPOINT']
        )

        # Read endpoint — local region replica
        self.read_client = boto3.client(
            'rds-data',
            region_name=self.region,
            endpoint_url=os.environ['AURORA_READ_ENDPOINT']
        )

    def write(self, sql, parameters):
        # All writes go to primary — no conflict risk
        return self.write_client.execute_statement(
            resourceArn=os.environ['AURORA_CLUSTER_ARN_PRIMARY'],
            secretArn=os.environ['DB_SECRET_ARN'],
            sql=sql,
            parameters=parameters
        )

    def read(self, sql, parameters):
        # Reads from local replica — lower latency
        # Accept eventual consistency — replica lag is typically <1s
        return self.read_client.execute_statement(
            resourceArn=os.environ['AURORA_CLUSTER_ARN_LOCAL'],
            secretArn=os.environ['DB_SECRET_ARN'],
            sql=sql,
            parameters=parameters
        )

4. EKS Multi-Region — Deploying to Both Clusters

Part 15 covered multi-cluster ArgoCD. From a multi-region perspective, the ApplicationSet generator pattern handles deployment to both regional clusters automatically:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services-multiregion
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - git:
              repoURL:  https://github.com/org/gitops-config
              revision: main
              directories:
                - path: apps/services/*
          - list:
              elements:
                - region:    eu-west-1
                  cluster:   https://eks-primary.eu-west-1.eks.amazonaws.com
                  valuesFile: values-prod-eu-west-1.yaml
                - region:    eu-central-1
                  cluster:   https://eks-secondary.eu-central-1.eks.amazonaws.com
                  valuesFile: values-prod-eu-central-1.yaml

  template:
    metadata:
      name: "{{path.basename}}-{{region}}"
      namespace: argocd
    spec:
      project: services-prod
      source:
        repoURL:        https://github.com/org/gitops-config
        targetRevision: main
        path:           "{{path}}"
        helm:
          valueFiles:
            - values.yaml
            - values-prod.yaml
            - "{{valuesFile}}"
      destination:
        server:    "{{cluster}}"
        namespace: "{{path.basename}}"
      syncPolicy:
        automated:
          prune:    false
          selfHeal: true

Region-specific values provide different endpoints per cluster while sharing the same base configuration:

# values-prod-eu-west-1.yaml
database:
  writeEndpoint: platform-primary.cluster-xxx.eu-west-1.rds.amazonaws.com
  readEndpoint:  platform-primary.cluster-ro-xxx.eu-west-1.rds.amazonaws.com

region: eu-west-1
isPrimary: true

---
# values-prod-eu-central-1.yaml
database:
  writeEndpoint: platform-primary.cluster-xxx.eu-west-1.rds.amazonaws.com   # Still primary!
  readEndpoint:  platform-secondary.cluster-ro-xxx.eu-central-1.rds.amazonaws.com

region: eu-central-1
isPrimary: false

5. Global Accelerator — Improving Active-Active Performance

Standard Route 53 latency routing sends users to the nearest region based on measured network latency. AWS Global Accelerator improves this further: users connect to the nearest AWS edge location (anycast), and traffic travels over the AWS private backbone rather than the public internet for the entire journey to your ALB.

resource "aws_globalaccelerator_accelerator" "main" {
  name            = "platform-global-accelerator"
  ip_address_type = "IPV4"
  enabled         = true

  attributes {
    flow_logs_enabled   = true
    flow_logs_s3_bucket = aws_s3_bucket.ga_logs.bucket
    flow_logs_s3_prefix = "global-accelerator/"
  }
}

resource "aws_globalaccelerator_listener" "https" {
  accelerator_arn = aws_globalaccelerator_accelerator.main.id
  protocol        = "TCP"

  port_range {
    from_port = 443
    to_port   = 443
  }
}

# Endpoint group — eu-west-1
resource "aws_globalaccelerator_endpoint_group" "eu_west_1" {
  listener_arn = aws_globalaccelerator_listener.https.id

  endpoint_group_region         = "eu-west-1"
  traffic_dial_percentage       = 50

  health_check_path             = "/health"
  health_check_protocol         = "HTTPS"
  health_check_interval_seconds = 10
  threshold_count               = 3

  endpoint_configuration {
    endpoint_id                    = aws_lb.eu_west_1.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

# Endpoint group — eu-central-1
resource "aws_globalaccelerator_endpoint_group" "eu_central_1" {
  listener_arn = aws_globalaccelerator_listener.https.id

  endpoint_group_region         = "eu-central-1"
  traffic_dial_percentage       = 50

  health_check_path             = "/health"
  health_check_protocol         = "HTTPS"
  health_check_interval_seconds = 10
  threshold_count               = 3

  endpoint_configuration {
    endpoint_id                    = aws_lb.eu_central_1.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

Traffic dial percentage is the traffic shifting mechanism during deployments and incidents. Setting traffic_dial_percentage = 0 on a region removes it from rotation entirely without removing the endpoint group configuration — a safer, faster alternative to DNS TTL-based traffic shifting.

6. Data Residency and Compliance

For European workloads, data residency is not optional. GDPR requires that personal data of EU residents is stored within the EU. Multi-region architectures that include non-EU regions for DR must carefully separate data by type.

# SCP — prevent data from leaving EU regions
resource "aws_organizations_policy" "eu_data_residency" {
  name = "eu-data-residency"
  type = "SERVICE_CONTROL_POLICY"

  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyNonEUDataServices"
        Effect = "Deny"
        Action = [
          "s3:CreateBucket",
          "rds:CreateDBInstance",
          "dynamodb:CreateTable",
          "es:CreateDomain",
          "elasticache:CreateCacheCluster"
        ]
        Resource = "*"
        Condition = {
          StringNotEquals = {
            "aws:RequestedRegion" = [
              "eu-west-1",
              "eu-west-2",
              "eu-west-3",
              "eu-central-1",
              "eu-central-2",
              "eu-north-1",
              "eu-south-1",
              "eu-south-2"
            ]
          }
        }
      }
    ]
  })
}

7. Chaos Engineering — Testing Your HA Before You Need It

An HA architecture that has never been tested is a hypothesis. AWS Fault Injection Simulator (FIS) lets you run controlled failure scenarios in production to validate that failover actually works.

# FIS experiment — simulate AZ failure
resource "aws_fis_experiment_template" "az_failure" {
  description = "Simulate AZ failure — validate multi-AZ pod rescheduling"
  role_arn    = aws_iam_role.fis.arn

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate_critical.arn
  }

  action {
    name      = "terminate-az-a-instances"
    action_id = "aws:ec2:terminate-instances"

    parameter {
      key   = "startInstancesAfterDuration"
      value = "PT5M"
    }

    target {
      key   = "Instances"
      value = "nodegroup-az-a-instances"
    }
  }

  target {
    name           = "nodegroup-az-a-instances"
    resource_type  = "aws:ec2:instance"
    selection_mode = "ALL"

    resource_tag {
      key   = "kubernetes.io/cluster/${var.cluster_name}"
      value = "owned"
    }

    resource_tag {
      key   = "topology.kubernetes.io/zone"
      value = "${var.region}a"
    }
  }

  tags = { Name = "fis-az-failure-test" }
}

What to test and when:

Test	Frequency	What it validates
AZ failure (FIS)	Quarterly	Pod rescheduling, Karpenter provisioning, no traffic drop
Route 53 failover (FIS or manual)	Quarterly	DNS failover time, secondary region warm-up speed
RDS replica lag under load	Monthly	RPO under realistic write volume
RDS failover (Multi-AZ)	Semi-annually	Failover time, connection pool recovery
Full region failover drill	Annually	Complete DR runbook validation end-to-end

8. Cost — Multi-Region Is Not Free

Single region (Active):                            $X/month

Active-Passive (1 primary + 1 standby at 30% capacity):
  ~$X + $0.3X = 1.3× single-region
  Add: inter-region data transfer ($0.02/GB TGW peering)
  Add: Route 53 health checks ($0.50/check + $1/million queries)
  Add: RDS cross-region replica (instance + storage + transfer)

Active-Active (2 full regions):
  ~2× single-region compute
  + inter-region replication transfer
  + Global Accelerator ($0.025/hour + $0.015/GB)
  + Full redundant networking in both regions

The business case for Active-Passive: If your SLA allows for 10 minutes RTO and 60 seconds RPO, Active-Passive with a warm standby (scaled to 30% capacity) delivers that at ~1.3× cost. For a $5,000/month platform, that is $1,500/month for DR capability — far cheaper than the reputational and contractual cost of an unplanned outage.

The business case for Active-Active: If your SLA requires <30 second RTO and near-zero RPO — or if you have significant user populations in multiple European regions that benefit from latency routing — Active-Active is the correct answer. The 2× cost is the cost of the SLA.

9. The Decision Framework

What is your RTO requirement?
├── < 1 minute    → Active-Active (DNS failover alone cannot achieve this)
├── 1-15 minutes  → Active-Passive with warm standby
└── 15-60 minutes → Active-Passive with cold standby (lower cost)

What is your RPO requirement?
├── Zero (no data loss)  → Active-Active with synchronous replication
│                           or Aurora Global Database
├── < 30 seconds         → Active-Passive with Aurora Global Database
└── Minutes acceptable   → Active-Passive with RDS cross-region replica

Do you have users in multiple regions?
├── YES → Active-Active with latency routing (serving AND DR benefit)
└── NO  → Active-Passive (DR only — pay for standby, not serving)

What is your compliance requirement?
├── Data must stay in EU   → Both regions in EU — SCP to enforce
├── DR region can be non-EU → Consider us-east-1 as standby (lower cost)
└── Air-gapped requirement  → Private Direct Connect + no internet peering

What is your budget?
├── Minimal             → Active-Passive, cold standby, RDS replica
├── Moderate            → Active-Passive, warm standby, Aurora Global
└── Enterprise/SLA-driven → Active-Active, Aurora Global, Global Accelerator

10. Common Mistakes & Anti-Patterns

Mistake 1: Overlapping CIDRs Between Regions

The most common multi-region networking mistake. If eu-west-1 uses 10.0.0.0/16 and eu-central-1 also uses 10.0.0.0/16, TGW peering cannot route between them — identical CIDRs are ambiguous. Plan your IP space org-wide from Day 1. Part 1 of this series covers CIDR planning in depth.

Mistake 2: Assuming DNS Failover Is Instantaneous

Route 53 health checks detect failure within 30 seconds (with 10-second interval and 3 failures threshold). DNS TTL determines how long clients cache the old record — typically 60 seconds. Total failover time: up to 90 seconds. For RTO requirements under 2 minutes, DNS failover alone is insufficient — use Global Accelerator which reroutes at the anycast layer within seconds.

Mistake 3: Never Testing the Failover

An untested failover is a hypothesis. The first time you fail over under real pressure is not the time to discover that the secondary database has 10 minutes of replication lag or that the secondary EKS cluster's node group is at min_size = 0 and takes 8 minutes to provision capacity. Test quarterly. Document what broke.

Mistake 4: Active-Active Without Conflict Resolution Strategy

Writing to two regions simultaneously without a conflict resolution strategy produces inconsistent data. Two users updating the same record in different regions within the replication window creates a conflict. Use last-write-wins (DynamoDB default), write to single primary with local reads, or application-level conflict detection depending on your consistency requirements.

Mistake 5: Forgetting Stateful Components

Kubernetes deployments fail over easily — pods reschedule, Karpenter provisions new nodes. Databases, message queues, and caches do not. Multi-region failover planning that covers EKS but not RDS, Amazon MQ, and ElastiCache is incomplete. Every stateful component needs its own failover strategy.

Mistake 6: No Pre-Warmed Capacity in the Secondary Region

A cold standby EKS cluster with min_size = 0 requires 8-12 minutes to provision EC2 nodes, bootstrap the kubelet, and schedule pods when failover occurs. A warm standby with min_size = 1 per AZ can absorb traffic within 2-3 minutes. The cost difference is typically 2-3 m5.large instances — approximately $130/month for the capability of meeting a 5-minute RTO.

Architecture Decision Matrix

Requirement	Active-Passive Cold	Active-Passive Warm	Active-Active
RTO	15-60 min	2-10 min	< 30 sec
RPO	Minutes	Seconds	Near-zero
Cost vs single-region	~1.1×	~1.3×	~2×
Traffic latency benefit	❌ None	❌ None	✅ Regional routing
Data consistency	Eventual	Eventual	Configurable
Operational complexity	Low	Medium	High
Best for	Cost-constrained DR	Standard enterprise DR	Global platforms, zero-downtime SLAs
Database pattern	RDS read replica	Aurora Global	Aurora Global or DynamoDB Global Tables
DNS pattern	Failover routing	Failover routing	Latency routing + health checks

The Golden Rule

"Multi-region is not a feature you add at the end — it is an architectural constraint you design for from the beginning. Non-overlapping CIDRs, stateless application tiers, externally replicable databases, and Route 53 health checks are all decisions that need to be made before you have traffic, not after your first outage. Choose Active-Passive when your RTO and RPO allow minutes of recovery and your budget does not justify double capacity. Choose Active-Active when your SLA demands near-zero recovery time or when you have real users in multiple regions who benefit from latency routing. And test your failover quarterly — an HA architecture that has never been exercised is a disaster plan nobody has read."

Ankush Panday

Specializing in highly scalable AWS infrastructure and automated quality engineering.

Connect on LinkedIn