On Tuesday, a massive Amazon AWS outage caused a four hour outage for many popular sites such as Spotify, Netflix, Reddit, Pinterest, and others. As it may seem that the only thing that could cause an outage like this is like a massive DDOS or other major issue, the truth behind the matter is slightly a bit more embarrasing.
According to a statement recently issued by Amazon, the AWS outage was actually nothing more than an OSI Reference Model Layer 8 issue, or human error. It appears that in the process of removing a small set of servers from Amazon’s billing system, the Amazon AWS employee in charge of the operation fat fingered the command, instead removing more servers than expected.
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Although human error will forever be present in pretty much anything that involves a human, Amazon is making a number of changes to ensure this doesn’t happen again or if it does, the effects are at least reduced significantly. This includes modification of the tool the Amazon AWS employee used to remove servers from the AWS cloud to remove capacity more slowly and ensure capacity never dips below the minimum necessary to maintain service. Additionally, Amazon AWS is also implementing new changes to improve recovery time in the event unexpected downtime occurs.