Overview#
This project built a multi-region active-passive architecture on AWS, ensuring production services could survive a full regional outage with a recovery time objective under 15 minutes.
Problem#
A single-region platform creates an existential risk for any service with uptime SLAs. When us-east-1 has an incident, there is no fallback path.
Solution#
Route 53 health-check-based failover routes traffic to us-west-2 automatically. Cross-region RDS read replicas are promoted to primary during a failover event. Recovery runbooks are automated via Lambda and validated in monthly game days.
