Inevitable changes in our infrastructure

by michal-frackowiak on 22 Jun 2012 19:48

Recent events (yes, I mean the performance and availability issues) made us reconsider our infrastructure — servers, routers, storage and how these things are set up. "Infrastructure" is not really something our users should be concerned about but we, the Wikidot Team, definitely should. The Wikidot promise is: we do the dirty job managing Wikidot platform and providing tools, you can concentrate on the cool stuff. Although the "dirty stuff" and whatever improvements we plan should not directly affect the way you use Wikidot, I believe it is good to share our plans in this matter too.

The key problems we are addressing are:

  • How can we adapt to long-term growth in content and traffic?
  • How can we adapt to short-term traffic spikes — like ones from popular sites like RoaringApps.com, Snow Leopard we had in the past?
  • Are there any single points of failure in our infrastructure?
  • What happens when key elements of our infrastructure fail?
      • Are they redundant? If so, is the failover easy enough?
      • How quickly can we repair or replace them?
      • Is Wikidot platform still available and usable?
  • Do we have a recovery plan?

In fact our current setup is quite resilient. Data durability has been always our top priority: we always keep multiple copies of database (redundant disks, redundant database servers, daily backup to a remote datacenter) and uploaded files (we use Amazon S3 to keep them safe and secure). We rely on quality hardware and quality datacenter services to keep Wikidot up to its growing tasks. It has been working quite well so far, with only a few issues (recently we discovered one of our servers has been up for 2 years without even a reboot!). But even "a few" issues is still too much for use.

In every scenario we considered we could fix a potential problem — replace a broken element, re-assign critical task to other servers, or even rebuild the whole cluster in a remote datacenter. But there are some potential problems we have identified:

  • Provisioning time: it takes 2-4 hours to get a replacement server in our current datacenter
  • Hardware failures: it takes from one hour up to several hours to fix a failed component
  • No 100% setup automation: there are parts (servers) configured "by hand" — even if we have scripts that automate most tasks, they require manual interaction
  • No automatic failover or healing: there are a few servers that play a critical role and any failover to a backup server must be done "by hand"

We are now dedicating a significant portion of our time to find a solution to the above problems. The goal is to create an automated, highly-available and scalable setup. We have the design almost ready with parts of the infrastructure already working. We are evaluating alternative datacenters too, including Amazon AWS.

It is not the first time we consider AWS as a home for Wikidot. But honestly, last time we did, AWS did not have half of useful features it has now. We already use AWS for a number of critical things, including email delivery, geo-aware content distribution, backups, file storage and even DNS.

Although Amazon AWS appears what could be a big win for Wikidot, but it is still too early for the final word. I would rather be careful praising AWS before we set-up a proof-of-concept stack and it proves to be as efficient as our current config.

I will keep you informed about our plans!

Comments: 5

Add a New Comment