Wikidot.com has just experienced about 2-hour downtime due to database issues. The situation has not been critical and no data has been lost. We were able to fix the problem and bring the service back on-line, despite the early-morning Sunday hours.
It is the first downtime of such length for a quite long time. We consider 2 hours to be very long when we think about availability. Usually we try to keep issues from putting Wikidot down and we achieve monthly availability rate of over 99.9%, often reaching 100.0% (as reported by monitoring tools at Pingdom, based on tests on selected number of sites).
I am terribly sorry about the unavailability. This of course gives us hints how to avoid such situations in the future. I also hope that because of the weekend and early morning (UTC) hours the downtime was not that painful, and hopefully not even noticeable for most of our users.
Actually what happened was that one of our database servers went down. This is nothing unusual, since we have data replicated to other servers. The event triggered a series of alarm states (more than 10), each generating several alert messages.
Database failover is still done manually due to potential danger of false positive condition test. When a master database fails (this is what happened), we need to manually pass the "master" role to one of other servers. The operation is automated and requires a simple "drag&drop" reconfiguration. Without that the database is not accessible.
What actually failed was the alert delivery system and human factor. I say it was such a rare coincidence that it should have never ever happened. There are three people receiving alerts through three different channels each (Pingdom alerts, Prowl alerts and email) that should reach us through mobile phones and laptops. When we are all in the office, a similar event generates plenty of various sounds.
This setup does work and we used to solve problems quickly no matter the time — night, evening, weekends. Waking up at 2AM to fix issues is nothing unusual here. In most cases we fix issues before they escalate and put parts of Wikidot down.
Unfortunately this Sunday none of us got these alerts due to various circumstances.
As soon as we learned about the problem, we tweeted about it and it was fixed in 5 minutes.
It might sound silly, but this is what actually happened. Needless to say, this shows an issue we need to work on.