12.13.2021

AWS Mistake: Our mistake, Amazon admits, albeit vaguely

It found that the general AWS outage on December 7th was caused by Amazon's own software and its response was hampered by ... its own software. What Does the Amazon Autopsy Really Tell Us?

aws-amazon-logo-5g-phone-lg-velvet-5893.jpg

Image: Angela Lang / CNET

The AWS outage on December 7, which hampered the operation of Amazon and many of its customers from the power split, now has an official, albeit vague explanation : It was our fault.

Must-read cloud

Specifically, it was AWS 'own internal software that caused the error, which essentially breaks down into an automated escalation error in the AWS mainnet, which causes "unexpected behavior" in a large number of customers in its internal network that previously operated essential services. caused. such as monitoring, internal DNS and authorization services.

GO TO: Recruitment Kit: Cloud Engineer (TechRepublic Premium)

"Because of the importance of these services in this internal network, we connected this network to several geographically isolated network devices and scaled the capacity of this network significantly to ensure the high availability of this network connection," said AWS. Unfortunately, one of those scaling services, which AWS said had been in production for many years with no issues, caused a massive spike in link activity that overwhelmed the devices handling communications between AWS 'internal and external networks to the networks. 7:30 am PST.

To make matters worse, the surge in traffic resulted in a massive spike in latency that affected AWS 'internal dashboards and made it impossible to use systems that were designed to find the source of the congestion. To find it, AWS engineers had to search the log files, which showed increased internal DNS errors. Their solution was to divert DNS traffic off congested network paths, which fixed DNS errors and improved some, but not all, availabilities.

Additional strategies have tried to further isolate distressed parts of the network, bring new capacity online, and so on. They have also been slow, AWS said. The latency of their monitoring software made it difficult to track changes, and their own internal delivery systems were also affected, making changes difficult. To make matters worse, the outage did not kill all AWS customers, so the team "was extremely deliberate while making changes to avoid impacting functional workloads," said AWS. It took a while, but at 2:22 p.m. PST, AWS said that all of your network devices had been fully restored.

AWS has disabled the escalation activities that caused the event and announced that it will not bring the system back online until all fixes have been implemented, which is expected in the next two weeks.

What you should consider from the AWS statement about your failure

As is often the case with statements like this, there's a lot to do, especially when AWS has been this vague, said Brent Ellis, senior analyst at Forrester. "The problem I see is that the description isn't specific enough to allow customers to plan for that particular bug. Not everyone hosted on AWS has failed. It would be helpful to understand what these companies were doing differently so that others could follow suit. For now, customers should trust AWS to fix the situation, "said Ellis.

Ellis also said that Amazon's testimony itself is alarming for reasons other than the occurrence of the outage: it indicates that if it can cause such widespread problems, the interaction between AWS's external and internal networks can be problematic.

GO TO: Checklist: How to Manage Your Backups (TechRepublic Premium)

That doesn't mean the cloud is a bad bet, Ellis said: He remains optimistic that it's a "great place for an enterprise technology move". However, Ellis brings you back to a similar refrain that has popped up since cloud outages: Risk .

"In general, [cloud providers] are always more redundant, more secure, and more reliable than most organizations' internal infrastructure, but they are not without risk," said Ellis. His personal advice to anyone dealing with the cloud is to diversify, tone down, and educate yourself. "When you can scale a service to run on multiple clouds or on the local cloud +; Do it. If that is not possible, negotiate the sharing of business risk, learn about [the cloud provider's] practices, and negotiate to align the practices with your internal resilience needs, "said Ellis.

Ellis describes cloud resilience planning the way an organization would design a secondary data center out of the disaster radius to ensure continuity. The cloud does all of these problems for you, Ellis said, but in return, a single human or automated error is compounded into much larger pieces of that company's infrastructure.

For the cloud to remain successful, according to Ellis, cloud providers need to standardize in some ways to make it easier to move data, duplicate workloads, and simplify redundancy. The goal, he said, would be a similar situation to international travel: you need an adapter for a different type of plug, but the underlying principles of operation are shared so you only need one virtual adapter to get around. from cloud A to cloud B.

GO TO iCloud vs OneDrive: Which Is Better for Mac, iPad, and iPhone Users? (Free PDF) (TechRepublic)

Gartner VP of Cloud Technologies and Services Sid Nag agrees with the ideal of interoperability, especially in today's world where large vendors are getting "too big" to go bankrupt.

"Our daily life depends more and more on the cloud industry; Cloud providers have to support each other, "said Nag. Like Ellis' recommendation, the ultimate goal appears to be a cloud marketplace that achieves its essential utility for modern society and seeks to become less competitive and error-prone.

"This is what utility will be like cloud computing. Once that is done, it will be easier to create services to move a workload when a problem occurs with a cloud provider," said Ellis.

Look too

Adblock test (why?)

Aucun commentaire:

Enregistrer un commentaire