What brought down Amazon Web Service this week? Fat Fingers…
What catastrophic failure brought down some of Amazon's cloud infrastructure this week? Could it be something terribly simple?
So, did you have problems earlier in the week accessing some of your favourite website or web services? That was the mighty Amazon Web Services having some serious downtime around its S3 offering.
That must have been pretty hairy in Amazon. Pretty stressful. Something major must have gone wrong. [Right](https://aws.amazon.com/message/41926/)?
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.
Wait a second, what was that again?
Unfortunately, one of the inputs to the command was entered incorrectly
A good chunk of the web was brought down by a typo. A typing error?
How fragile this new world of our can be…
(You can blame this entire post on my deep and abiding happiness as a long term journalist that another profession now understands the deep horror of typos…)