144 Seconds Downtime in February, Replacing Servers to Handle Growth

by michal frackowiak on 04 Mar 2010 09:00

We are using various techniques to monitor health of Wikidot servers, ranging from our home-brewed scripts, to external services like Pingdom. Actually, Pingdom is really great and their automated services alert us every time there is a low-level problem with Wikidot — e.g. servers are overloaded and cannot handle connections, something crashes, datacenter has network problems etc.

I must say that February was a bit boring. The total downtime (service unavailability) was 2 minutes and 22 seconds, as reported by Pingdom, which gives us a

really good 99.994% uptime!

Now, this starts to be boring…

Now I can tell you why. Over the last year we have been working hard to fix most server-related quirks, find workarounds for PHP bugs, and constantly tune configurations of our servers. We also started replacing old servers with top-notch pieces of hardware, e.g. using super-fast SSD drives for our databases. And for sure — configuring servers to handle 5,000,000 hits every day is, as we have learned the hard way, not a piece of cake.

99.994% uptime makes our Team very happy and confident about our service. I think it is a superb result and I hope we will continue providing Wikidot services non-stop in the future.

I cannot resist and also say that the hosting company, SoftLayer, is really a superb one. Reliability of Wikidot is in a large part also thanks to their remarkable service.

And by the way: we are working on high-availability for all our users and sites, not only those with Pro accounts.

Replacing the server handling uploaded files

In the next few days we will be making a big step forward by replacing the server directly handling file uploads. We are dumping the oldest server we have at SoftLayer, replacing it with a shiny, new, and also much faster one.

Although we have the whole procedure planned, there might be small unexpected problems with uploads or slower serving of uploaded files today and tomorrow. Which, of course, we will try to avoid, but you know, just in case…

If you find out that something terribly wrong is happening to your files, please let us know, preferably as a comment to this blog post. We will keep you up-to-date too.

Migration updates

UPDATE: custom themes based on [[code]] extraction were not accessible yesterday for a few hours, starting from roughly 4 PM UTC, when we started the *.wdfiles.com migration.

This was because of a limitation of PCRE (Perl-compatible regular expression) library that we use massively, including when traversing wiki source to get the contents of code blocks.

The problem was reported at 9:40 PM UTC by RobElliott and was tracked down and solved about two hours later by Michał. What we needed to do was recompilation of this library to use heap instead of stack for dealing with recursive regular expressions (otherwise overflow occurs). We needed to do this for every server before, and we missed just the new one.

From 11:30 PM UTC custom themes work correctly as confirmed by users.

UPDATE2: We have noticed you can't rename files and backups of sites are not done. This is going to be fixed when we end the migration of *.wdfiles.com.

So far we have successfully copied all the user files and have them fully in sync. This includes the changes done during the migration. That was achieved using unionfs. Now we can "merge" the changes back into the original directory and renaming (thus backups) should be working again. Expect these features working till tomorrow.

Comments: 29

Add a New Comment