Wikidot.com has just experienced about 2-hour downtime due to database issues. The situation has not been critical and no data has been lost. We were able to fix the problem and bring the service back on-line, despite the early-morning Sunday hours.
It is the first downtime of such length for a quite long time. We consider 2 hours to be very long when we think about availability. Usually we try to keep issues from putting Wikidot down and we achieve monthly availability rate of over 99.9%, often reaching 100.0% (as reported by monitoring tools at Pingdom, based on tests on selected number of sites).
I am terribly sorry about the unavailability. This of course gives us hints how to avoid such situations in the future. I also hope that because of the weekend and early morning (UTC) hours the downtime was not that painful, and hopefully not even noticeable for most of our users.
More explanation:
Actually what happened was that one of our database servers went down. This is nothing unusual, since we have data replicated to other servers. The event triggered a series of alarm states (more than 10), each generating several alert messages.
Database failover is still done manually due to potential danger of false positive condition test. When a master database fails (this is what happened), we need to manually pass the "master" role to one of other servers. The operation is automated and requires a simple "drag&drop" reconfiguration. Without that the database is not accessible.
What actually failed was the alert delivery system and human factor. I say it was such a rare coincidence that it should have never ever happened. There are three people receiving alerts through three different channels each (Pingdom alerts, Prowl alerts and email) that should reach us through mobile phones and laptops. When we are all in the office, a similar event generates plenty of various sounds.
This setup does work and we used to solve problems quickly no matter the time — night, evening, weekends. Waking up at 2AM to fix issues is nothing unusual here. In most cases we fix issues before they escalate and put parts of Wikidot down.
Unfortunately this Sunday none of us got these alerts due to various circumstances.
As soon as we learned about the problem, we tweeted about it and it was fixed in 5 minutes.
It might sound silly, but this is what actually happened. Needless to say, this shows an issue we need to work on.
It's okay
Î was sleeping… :)
Interesting for me ( as an old database admin) what has happened to the database ?
Service is my success. My webtips:www.blender.org (Open source), Wikidot-Handbook.
Sie können fragen und mitwirken in der deutschsprachigen » User-Gemeinschaft für WikidotNutzer oder
im deutschen » Wikidot Handbuch ?
It's fine, just seems at times that the Wikidot Team is unreachable. I'd love to see an "Aware of the problem" post on Twitter or something, followed up by a "Problem is now resolved" message later on.
At least that way, we can see if you know about the problem, and be assured that even if there is downtime, it won't be for much longer.
Thank you for your efforts to resolve the problem on a weekend :)
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Whaaaaaat? It's outrageous that just because it's night time some of our admins seem to think it's OK to get some sleep.
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
I agree with Shane, a quick tweet would provide some re-assurance that although the car has broken down someone is tinkering with the engine, even on a sleepy Sunday morning (European time).
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
I agree with Shane and Rob too. But, I would appreciate a message on FB, as I don't have (and I an not thinking of having) the Twitter account…
A suggestion:You can connect FB with Twitter, so when you post on Twitter, it will automatically apprear on FB, and you won't have to write messages on both places.
If slaughterhouses had glass walls, everyone would be vegan. - Paul McCartney
is not the twitter conversation with or including #wikidot embedded on our community site in the left side menu?
I always start the conversation from this frame in our community to have a look what is new on twitter
( I do not have any FB account and do not want it for the future, I do not need it for business).
Service is my success. My webtips:www.blender.org (Open source), Wikidot-Handbook.
Sie können fragen und mitwirken in der deutschsprachigen » User-Gemeinschaft für WikidotNutzer oder
im deutschen » Wikidot Handbuch ?
Yes it is, but if Wikidot is down, then you cannot see it. ;)
It doesn't matter who has what account, Twitter or Facebook, the reality is that some of Wikidot users have only Twitter or only FB account, and it would be nice from Wikidot when they have problems to inform their users that they know about the problem and that they are working on it. The connecting their two accounts, Twitter and FB gives them the advantage to use only one account (the one they prefer), and to have both social web users equally informed.
Having said that, I must admit this kind of problems is very rare in Wikidot, and they almost always reacted to my help calls on FB…
If slaughterhouses had glass walls, everyone would be vegan. - Paul McCartney
Oha !
Thanks for clarification !!
Service is my success. My webtips:www.blender.org (Open source), Wikidot-Handbook.
Sie können fragen und mitwirken in der deutschsprachigen » User-Gemeinschaft für WikidotNutzer oder
im deutschen » Wikidot Handbuch ?
Actually what happened was that one of our database servers went down. This is nothing unusual, since we have data replicated to other servers. The event triggered a series of alarm states (more than 10), each generating several alert messages.
Database failover is still done manually due to potential danger of false positive condition test. When a master database fails (this is what happened), we need to manually pass the "master" role to one of other servers. The operation is automated and requires a simple "drag&drop" reconfiguration. Without that the database is not accessible.
What actually failed was the alert delivery system and human factor. I say it was such a rare coincidence that it should have never ever happened. There are three people receiving alerts through three different channels each (Pingdom alerts, Prowl alerts and email) that should reach us through mobile phones and laptops. When we are all in the office, a similar event generates plenty of various sounds.
This setup does work and we used to solve problems quickly no matter the time — night, evening, weekends. Waking up at 2AM to fix issues is nothing unusual here. In most cases we fix issues before they escalate and put parts of Wikidot down.
Unfortunately this Sunday none of us got these alerts due to various circumstances.
As soon as we learned about the problem, we tweeted about it and it was fixed in 5 minutes.
It might sound silly, but this is what actually happened. Needless to say, this shows an issue we need to work on.
This really does not make us happy. I remember at least one event in the "early days" when Wikidot was unavailable for half a night just because the server went down and there was no monitoring set up to notify me. At the time I was the only person responsible. Wikidot has evolved since then, it does not depend on a single person any more and has a really well developed monitoring set-up. However, as can be seen, it can still fail :-(
Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me
Thanks for the detailed update Michal. I really appreciate the time and effort that you guys put into running Wikidot. It's nice to know that you guys have failovers for database failures — even though it still requires manual intervention.
Perhaps it's a good idea to have a status site (running on infrastructure separate from Wikidot), that gives us a definitive status of the service and the team's awareness. Because the current error message we get gives us rather limited and outdated information linking to this page that was created in 2009 that seems to outline "current problems" of the time.
Kenneth Tsang (@jxeeno)
Agreed, thanks for the detailed information.
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Right. Outages do not happen often, but we could work on communication when they do. Perhaps a combination of an error-page and recent tweets?
Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me