Performance Problems Update

by pieterh on 28 Aug 2009 14:15

Over the last days we've seen problems as many people came to http://snowleopard.wikidot.com looking for information on Apple's upcoming Mac OS X release. Today I can explain in more detail why these problems happened, and how we're going to fix them.

First, you'll see that Wikidot is now stable and reasonably fast (though it could be faster, and will be once we're done). The Snow Leopard site is not the only high-volume site, there are several popular ones, like http://fretsonfire.wikidot.com.

We moved the Snow Leopard page to a static HTML page, so that anonymous visitors see that page, while logged-in users see the 'real' page. This cuts traffic going to the Wikidot engine. You can see this page if you log out.

The problem with this particular page is the number of people editing it, and requesting it at the same time. What that means is that after each edit, the cache is hit by hundreds or thousands of refresh requests in parallel. This creates a backlog that slows down the whole engine, so more and more new requests pile-up in the queue. It takes longer and longer to process each request, so the queues go from 30-50 (normal) to 200-500, at which point everything starts to fail. This is poor design in the cache, which should not render the new page more than once. But it's not something that can be fixed easily.

Instead of changing the way we cache and render pages, our new architecture adds a new front layer of static HTML that mirrors every single page. It serves these pages to anonymous users, without ever hitting the cache and the Wikidot engine core. These mirror will refresh asynchronously, obviously when the page is edited, but also regularly so that dynamic content (from modules) is refreshed. Anonymous users will see a site that is always a minute or so out of date. Logged-in users will see the latest revision, exactly as now.

The main benefit of this design is that the bulk of high-volume traffic will be handled with no stress at all to the Wikidot core. Moving that single Snow Leopard page to static HTML already cut the CPU loading by 35-40%, and doing this for all sites will produce an even more dramatic change. The result will be, if we get this right, that editing and reloading dynamic content will be much faster for logged-in users than it is now.

You can consider the current workaround for the Snow Leopard page (which we had to construct in some haste last night, as Wikidot continued to crash, leading us to realize the cache was thrashing) as a prototype for this new architecture.

Two additional benefits: we can scale out the static HTML mirrors to any size, meaning that we'll be able to handle millions of hits per day, or per hour, if necessary. And secondly, we'll have independent static HTML mirrors of every site so that if the Wikidot engine does die, for other reasons, logged-in users can get the static HTML mirror as well. Even if the Wikidot engine takes too long to respond, the front-end can switch to the static HTML mirror and add a suitable message. Proper failover, and scalability.

It will take us some weeks to completely work through the implications of this. I'll keep you informed of our progress.

Comments: 12

Add a New Comment