top of page

Too Many Eggs, One Basket: Lessons from the AWS Outage

  • 10 hours ago
  • 6 min read
Connection error

In the early morning of October 20, 2025, Amazon Web Services, the backbone of much of the modern internet, experienced a widespread outage in its Northern Virginia region. Within hours, popular apps, business platforms, and government services began to slow or fail. By evening, AWS reported that services were operating normally, with some backlogs clearing after that. This was not some minor hiccup. It took much of the day to resolve, and by the time systems steadied, the outage had already reminded everyone how deeply daily life depends on the same shared foundations. 


The Impact 

The outage originated in AWS’s US-EAST-1 region, which supports a significant portion of global cloud activity. That single region underpins countless tools and services used every day by businesses, governments, and consumers alike. Well-known platforms such as Zoom, Venmo, and Alexa saw interruptions, but the effects reached much farther than that. 


For many organizations, the disruption was one step removed. Their own systems appeared stable, yet vendors or downstream providers that relied on AWS began to falter. Even companies with no direct contract felt the slowdown through partners and service integrations that quietly depend on the same infrastructure. 


The Cause 

AWS said the incident stemmed from DNS resolution issues that affected DynamoDB service endpoints in US-EAST-1, and they began mitigation after identifying the problem (AWS update). In parallel, traffic health checks did not behave as expected, which complicated rerouting and recovery. The combination created a chain of disruptions that took most of the day to unwind. 


In short, one lookup broke, one database stalled, and everything built on top of them learned what “shared dependency” really means. 


The Response 

AWS posted regular updates, isolated the DNS issue, and restored service, with some queues taking longer to clear. By evening, operations were mostly normal. 


AWS confirmed that the outage was not the result of a cyberattack and said a detailed incident analysis would be released. The company’s updates through its status page and social channels provided transparency but were highly technical, which made it difficult for non-technical teams to interpret and share meaningful updates inside their organizations. 


What This Illustrates About Concentration Risk 

This was concentration risk in practice, too much dependency in one place. The AWS US-EAST-1 region is popular because it is large, efficient, and cost-effective. That popularity concentrates demand, which can magnify impact during an incident. 


When multiple organizations and their vendors depend on the same region, a single problem can become a multi-industry event. Many companies that felt diversified discovered their vendors were sitting on the same underlying infrastructure. 


What It Reveals About Fourth- and Nth-Party Risk 

Even companies far removed from AWS saw disruptions. That is extended vendor risk, where your vendor’s vendor, or their vendor’s vendor, fails and causes impact for you. 


A payment platform might use AWS directly, while your billing software depends on that platform. Your HR system’s analytics add-on might sit on AWS even if the core platform does not. The farther down the chain the issue occurs, the harder it is to see, yet the business effect is the same. 


The Broader Lesson: Shared Infrastructure Means Shared Consequences 

Cloud services and computing have made business faster and more connected. It has also made it interdependent. When one provider falters, entire industries can feel the shock. 


Technical events become business events quickly. Disruptions affect customer access, transactions, revenue, and regulatory expectations. For TPRM programs, resilience is not about predicting every outage. It is about understanding dependency risk and being ready to respond calmly when it appears. 


Making a plan of action

What TPRM Practitioners Should Be Doing Now 

The AWS outage was a free stress test. Even if your organization stayed upright, it showed how much depends on a handful of cloud providers. Now it’s time to turn awareness into action. 


1. Revisit your dependency map 

Trace your direct, fourth-party, and nth-party exposure. You do not need to document every sub-vendor, but you should know where critical systems live and who connects them. 

  • Review your direct vendors and note hosting provider and region. 

  • Identify shared dependencies across your portfolio. 

  • Flag any service that leans on a single region. 

  • Share this with cybersecurity and IT partners to align contingency plans. 

 

2. Strengthen collaboration between TPRM and Cybersecurity/Information Technology 

When an outage hits, both perspectives are essential. 

  • Cyber professionals (which may include the incident response team) focus on the how, root cause, technical exposure, and data integrity. 

  • TPRM focuses on the so what, business impact, vendor accountability, and continuity of services. 


Confirm with IT which systems can run from more than one location. Confirm with TPRM which vendors must maintain uptime and notify you. If this partnership is informal, formalize a simple workflow that defines who watches vendor status, how alerts move to business leaders, and who decides when to communicate with executives or customers. 


3. Update due diligence and contracting 

Bake resilience into every step of the vendor lifecycle. 


During due diligence 

  • Ask where systems are hosted, including backup regions. 

  • Require disclosure of key sub-vendors such as cloud hosts and data processors. 

  • Confirm that failover is tested and recent. 

  • Check that downtime tolerance matches your business needs. 


In contracts 

  • Add notification timelines for incidents that affect your data or operations. 

  • Require vendors to maintain and test continuity and disaster recovery plans on a regular basis (at least annually). 

  • Define how credits or remedies apply during regional incidents. 

  • Include data portability and exit terms so you can migrate if reliability declines. 


For existing contracts, capture this through an addendum or vendor questionnaire. The goal is alignment between your expectations and actual capabilities. 


4. Treat vendor resilience as an ongoing metric 

Do not let resilience live in a one-time questionnaire. 

  • Track uptime and incident response quarterly. 

  • Watch how vendors communicate during industry-wide disruptions. 

  • Follow up with any vendor that takes more than a business day to confirm whether they were affected. 


Transparency and communication matter as much as uptime. 


5. Bring the lesson to leadership 

Executives and boards care about continuity, not DNS details. Use this event as a case study. 

Keep it in business terms. 

  • How long could you operate if your main region failed? 

  • Which vendors share that region? 

  • How long does recovery actually take in hours, not in theory? 


Boards and regulators should already be asking about cloud concentration and systemic risk. Showing mapped dependencies and credible plans signals maturity and foresight. 


Tabletop exercise

Not Ready for All That Yet? Try This Instead 

If your program is not ready for the full list above, start smaller. A one-hour tabletop can surface the most important gaps before you redesign your program. 


A One-Hour Tabletop: “When the Cloud Falters” 

Scenario: Your most important customer-facing service is degraded for six hours because your cloud provider’s main region is down. 


Prompts: 

  1. What fails first, and who notices? 

  2. Who owns communication with leadership and customers? 

  3. What do you tell executives in the first 30 minutes? 

  4. What data confirms whether the issue is internal or supplier-related? 

  5. If the outage lasts more than four hours, how do you continue operations? 

  6. When and how do you tell customers you are stable again? 


What good looks like: 

  • Clear ownership of communication and impact analysis. 

  • Named roles for executive updates and recovery coordination. 

  • A realistic recovery time, not a guess. 

  • Two improvement items assigned for follow-up within 30 days. 


Start here. Capture where confusion happens and what slows decisions. The results will show you where to strengthen communication, contracts, and coordination next. 


Conclusion 

The AWS outage was not just about downtime. It was about concentration risk and dependency, and how quietly it grows until something forces everyone to see it. What looked like one point of failure was really a network of shared reliance across vendors, industries, and geographies. 


For TPRM professionals, the lesson is to stop treating concentration as abstract and start treating it as operational reality. Every vendor, every contract, and every dependency tells part of that story. The work ahead is not to eliminate risk, it is to ensure that when one link breaks, which it inevitably will, the rest of the chain holds. 


Additional Resource

Explore our certificate, Securing SaaS Applications: A Comprehensive Approach to Cloud Risk Management, which provides an in-depth look at evaluating and managing risks associated with cloud-based SaaS solutions.


Author Bio

Hilary Jewhurst Photo

Hilary Jewhurst

Sr. Membership & Education Coordinator at TPRA


Hilary Jewhurst is a seasoned expert in third-party risk and risk operations, with nearly two decades of experience across financial services, fintech, and the nonprofit sector. She has built and scaled third-party risk programs from the ground up, designed enterprise-wide training initiatives, and developed widely respected content that helps organizations navigate regulatory complexity with clarity and confidence.

Known for turning insight into action, Hilary’s thought leadership and educational work have become go-to resources for professionals looking to mature their TPRM programs. She regularly publishes articles, frameworks, and practical guides that break down complicated risk topics into meaningful, accessible strategies.


Hilary recently joined the Third Party Risk Association (TPRA) as a staff member, supporting industry-wide education, peer learning, and advancing best practices. She is also the founder of TPRM Success, a boutique consultancy that helps organizations strengthen their third-party risk management capabilities through targeted training, tools, and strategic guidance.

bottom of page