S3 and Wikidot outage

by michal-frackowiak on 01 Mar 2017 10:54

Some of you noticed Wikidot was down yestarday for more than two hour. This is not something that happens often — previous serious outage we had was in November 2014.

The root cause of this outage was Amazon S3 failure. Wikidot, as well as millions of other websites all over the internet, rely on S3 for hosting files. S3 is exactly where we keep the files you upload in your wikis, but it's also where we keep JavaScript and CSS files required for displaying our web pages in your browsers.

Selection_042.png

Till now S3 has never failed before to this extent. It sure was not a small failure and it took down several other Amazon services. It also affected all services and websites that use Amazon Web Services. It's not just Wikidot alone: Trello, Travis CI, GitHub and GitLab, Quora, Medium, Signal, Slack, Imgur, Twitch.tv, Razer, Apple's iCloud and several other websites could not function properly (or were not reachable at all). A significant percentage of websites all over the world relies on S3 and only now we learned what happens when it's down.

The issue was so severe that even Amazon could not update their status board to let us know about the problems. It was probably hosted on S3 as well… It looks like engineers simply assumed that S3 would be available no matter what.

Wikidot infrastructure design relies on a certain assumption about S3 as well. We simply assumed everything can break, but not S3 itself. Even our backup site (in case databases and servers fail) is hosted on S3.

I guess today several admins and developers (especially from services affected by the S3 outage) try to find a way to loosen their dependencies on S3 and protect their services against similar events. We are going to look at this too — the data you keep with us is our top priority.

Thanks for your understanding and I am sorry for any trouble our outage might have caused.

Michal and the Wikidot Team

Comments: 19

Add a New Comment