
Maybe we're all living in Bezos world.
On October 20, 2025, Amazon Web Services experienced what would become their largest outage in a decade. A single race condition in their DNS management system cascaded through 113 AWS services, disrupted over 1,000 businesses globally, and left millions of users without access to their favorite apps for 15+ hours.
If you're facing issues loading your favourite app (Reddit, Netflix, Duolingo), chances are because of AWS. Let's dig in!
At a glance:
- 🌐 The Internet Broke → 1,000+ businesses offline, 11M outage reports, Snapchat to banking collapsed
- 💥 1 Infrastructure Failure → The DNS race condition that broke AWS
- ⚠️ 2 Critical Mistakes → A Fragile Phonebook and a horde of Retry Floods
- 🧠 3 Lessons Learned→ Multi-Region ≠ Multi-Provide, Design for Recovery, and being aware of your dependencies
- 💰 The Real Cost → Hundreds of billions lost vs. tiny service credits from AWS
Failed outage: the us-east-1 chain reaction
On October 20, 2025, the internet feels silent. Your favorite apps, streaming services, and banking tools were stuck in a loading wheel. The problem? AWS's oldest and largest region, us-east-1, was having a very bad day.
A very, very bad day.
Act 1: Where did the phonebook go?
Oct 19, 2025. 1148PM PDT
AWS customers experienced increased Amazon DynamoDB (AWS's Database Service) API error rates in the N. Virgina (us-east-1) region. Other AWS services that rely on DynamoDB were unable to establish a new connection to the service as well.
The culprit? A tiny, latent bug in the automated system that manages the DNS (the internet's phonebook) for DynamoDB.
This system has two parts:
- The Planner: Decides what the phonebook should say (which IP addresses go to which service).
- The Enactor: Publishes that phonebook entry to the world.
What went wrong:
- A Delay: An Enactor (let's call it Worker A) picked up a phonebook named dynamodb.us-east-1.amazonaws.com to publish, but got stuck in a processing delay.
- A Normal Update: While Worker A was stuck, Worker B published the new, correct phonebook for DynamoDB. All was well.
- The Clean Up: Worker B, as part of its job, started a routine clean-up to throw away old, expired phonebook entries.
- The Race: At the exact same moment, the delayed Worker A finally finished its work and published its now old, expired phonebook, overwriting the correct one.
- The Deletion: The clean up process from Worker B saw this old, expired phonebook (which was now the active one) and deleted it.
No IPs. No failover. The phonebook entry for DynamoDB was now completely empty.
As far as the internet was concerned, dynamodb.us-east-1.amazonaws.com no longer existed.
Act 2: The Dominoes Fall
The bigger problem? AWS's own services (EC2, Lambda, S3, etc) use DynamoDB as their internal database.
When DynamoDB phonebook vanished, the DropletWorkflow Manager DWFM (the system that's responsible for management of all physical AWS EC2 servers), tried to access it's database but couldn't. It started failing its internal checks, thus it can't give customers a new virtual server.
The result: All new EC2 instance launches failed with "Insufficient Capacity" errors.
Act 3: The Traffic Gridlock
Oct 20, 2025. 224AM PDT
AWS engineers manually re-add the correct IPs into AWS Route 53 (AWS's DNS Service). DynamoDB begins to recover. Hooray!
But the real chaos has only just started.
When the engineers manually added the records back to the phonebook, the control plane entered a "congestive collapse". This caused the internet outage you see, from social medias like Reddit unreachable, to banks like Llyods failed.
Even AWS’s own status pages went down, because they ran in us-east-1 too.
What happened:
- EC2 control panel woke up after the engineers fixed it, and tried to re-check the status of millions of servers, all at the same time. This created a digital traffic jam known as congestive collapse.
- The system was flooded with a giant backlog of work, thus causing tasks to time out.
- The system auto retried the timed out task, adding them back to the beautiful jam.
- This retry loop caused a total gridlock. The system was so busy trying to clear the jam that it couldn't make any new progress!
AWS Engineers' Response:
- Manually fix DNS records.
- Disable the Enactor automation globally.
- Slowly throttle operations to allow traffic to drain.
2:20 PM PDT: 🎉 AWS declares "all services returned to normal operations". A total of 15 hours from initial error to restoration.
2 Big Mistakes
Mistake 1: The Fragile Phonebook (DNS Race Condition)
The system that manages AWS's own DNS records had a flaw: a simple, unlucky combination of timing and delays allowed an automated clean-up process to delete a live, production record.
Mistake 2: The Retry Floods (Congestive Collapse)
When EC2 DWFM came back online, it wasn't prepared for the floods of backlog. The retry logic which was meant to add reliability, instead created a vicious cycle that brought the system to a halt. It literally DOS'd itself.
3 Lessons Learned
Lesson 1: Multi-Region ≠ Multi-Provider
The outage exposed a harsh truth: Multi-region deployment within one cloud provider is insufficient protection for large scale production.
Many companies discovered their "redundant" multi-region setup failed simultaneously because:
- us-east-1 hosts global IAM authentication.
- Control plane operations are centralized in one region.
- DynamoDB Global Tables depend on us-east-1 coordination.
Key Takeaways:
Multi-Region Within AWS (Baseline Protection)
- Deploy to multiple AWS regions (e.g., us-east-1 + us-west-2)
- Use Route 53 health checks with latency-based routing
- Implement database replication (DynamoDB Global Tables, Aurora Global)
- This protects against: Regional failures, AZ outages, and regional network issues
Multi-Cloud Architecture (True Independence)
- For critical services, distribute across AWS + Azure/GCP
- Containerize with Kubernetes for cloud-agnostic deployment
- Abstract cloud-specific services behind common interfaces
- This protects against: Provider-level failures, control plane outages, DNS issues
Cost vs. Benefit:
- Multi-region AWS: +50-100% cost
- Multi-cloud: +100-200% cost
- But for critical services: Cost of outage >> Cost of redundancy
Lesson 2: Design for Recovery, Not Just Uptime
AWS fixed the DNS bug in 2.5 hours, but needed 12+ hours for recovery. Why? Because recovery at scale is a separate challenge.
Key Takeaways:
Implement Circuit Breakers Everywhere
Circuit breakers prevent cascading failures by failing fast when dependencies are unhealthy.
Three states:
- Closed (normal): Requests pass through
- Open (failure): Requests fail immediately without trying
- Half-open (testing): Limited requests test if service recovered
Implement Queue Backpressure
Queues accumulate work during outages. Without limits, recovery overwhelms the system.
Solutions:
- Bounded queues: Set maximum queue size, reject new messages when full
- TTL (Time-To-Live): Drop messages older than X minutes
- Priority queues: Process fresh messages before stale backlog
- Rate limiting: Gradually increase processing rate during recovery
Graceful Degradation Beats Total Failure
During recovery, reduce functionality instead of failing completely:
- Serve cached data instead of live data
- Disable analytics/telemetry to reduce load
- Return read-only mode for databases
Lesson 3: Your Dependencies Have Dependencies
When AWS's status dashboard showed "all is well" for 75 minutes during the outage, it wasn't lying, it was broken too. The dashboard depended on services affected by the outage.
Key Takeaways:
Map your entire dependency graph
- What does your app depend on?
- What do those services depend on?
- Keep going 3-4 levels deep
Build with failure in mind
- Circuit breakers (refer above) at every layer
- Graceful degradation (refer above), which reduce features instead of failing completely.
- Bulkheads, which isolate components so one failure doesn't spread
Example:
Your App
↓
AWS Lambda (depends on)
↓
DynamoDB (depends on)
↓
DNS (depends on)
↓
Route 53 Control Plane (depends on)
↓
DynamoDB US-EAST-1 ← [THIS BROKE, EVERYTHING DIED]Final Thought
One phonebook. One race condition. One region.
That's all it took to knock out a good amount of the internet for half a day.
Cloud service providers like AWS are reliable, but that doesn't mean we should take for granted. Sometimes, all it takes is just one random bug that's overlooked—until that very critical day.
Thanks for reading. I hope you learn something! Until next time