To people all over the world, the word “Amazon” is more synonymous with the ecommerce giant than the actual river the name comes from. In fact, before most people would link “Amazon” to the river, they’d probably link it to words like “success,” “power” and “ingenuity.”
That’s why so many were absolutely shocked when the AWS outage occurred earlier this year. Despite its unprecedented rise, Amazon has been a technological pioneer with very few real bumps on its road to success. When AWS’s Simple Storage Service went down, it took significant chunks of the World Wide Web with it. While people all over the world demanded answers, Amazon took more than 24 hours before finally saying something on March 1st.
Human Error? Seriously?
Yes. Seriously. According to Amazon, this gigantic mishap was caused by a single employee who was simply working on a debugging a problem with the billing system. Apparently, this person took down numerous servers unintentionally. This one mistake set off a domino effect that ended with two other server subsystems going down and then more to follow.
As each system lost significant amounts of their capacity, the only fix was to execute a full restart. While this was happening, S3 couldn’t service requests. With the S3 APIs out of the picture, Amazon reported that other AWS services located in the US-East-1 region were impacted. These included:
- EBS (Amazon Elastic Block Store) volumes (when data was required from a S3 snapshot)
- Amazon Elastic Compute Cloud (EC2) new instance launches
- AWS Lambda
- S3 console
All of this occurred because of one person’s innocent error while taking care of a routine IT problem.
You’d be right to be shocked by this news. Many people felt the exact same way. How could such a technological titan let something so small cause such a huge problem for millions of people all over the world?
While Amazon didn’t go into specifics, their response was clear in promising that changes were being made to prevent a similar issue in the future. These even included measures they would take to mitigate what kind of damage human error could produce in the future. After all, no matter how technologically advanced we become, humans always have to play a role.
One specific change they did outline was changing the tool employees use to take server capacity offline so that it won’t allow anyone to remove so much so quickly. This alone would have kept the AWS debacle from occurring.
Another effort Amazon made has nothing to do with preventing the problem but affects the way it will be reported going forward. Should something like this ever happen again, the AWS Service Health Dashboard will continue working. When this incident occurred back in February, it also went offline. As the webpage people use to check if AWS services are functioning normally, its absence during the crisis only served to heighten the alarm.
How Has Amazon Been Affected?
It speaks volumes for the company that a problem like this that made international headlines hasn’t seemed to do a whole lot of damage to consumer faith. AWS is still on pace to bring in $14 billion this year. Companies of every size continue to use it and that trend doesn’t seem to be going away anytime soon.
The more important takeaway from this story may be what you’re going to do in response. Even if you plan to ditch AWS for a competitor, as we mentioned earlier, you still need to depend on humans. It would be wise to use Amazon’s problem as a learning opportunity to think about what kinds of changes you may need to make to avoid a similar fate for your business.