About a year ago in one of our December blog posts we were aiming to keep downtime under 5 minutes per month and we have finally achieved it. To keep our reliability so high we are still improving our code, infrastructure and monitoring services.
During the last year we used many techniques and involved a few software solutions that together made our service more reliable, available and stable for both anonymous and logged-in users.
As we described just after introducing the solution, we let Varnish cache pages for anonymous users. This means the application servers do less work and the service is less loaded overall.
Application server is the heart of our infrastructure. It processes all our user requests and responses to them so it is very important to keep its performance high. This is why we regurarly extend our server inventory with new dedicated multi-threaded multiprocessor servers with lots of memory and CPU power to allow more requests to be processed by PHP scripts and to make it run faster. We can sleep well because the processing power we have is way more than we need.
As fetching data from database takes some time (even when keeping the files on super-fast solid-state disks), we do a lot of caching in our code. Recently we reviewed parts of the code that generate the biggest number of queries to the database and applied robust Memcached-based caching to it. This keeps lower server response time.
Usually before a problem appears, there is some sign of it coming up, like improved resource usage, high traffic comming on or just CPU on servers using all their power to generate the pages for users. Usually there's time to react before the problem affects users sites. This we improved and extended monitoring services.
We're monitoring not only server resourses, but also Wikidot actions, like number of accounts creates, files uploaded or pages edited. Having any of those values dropped to 0 means a serious problem. Having them way too high usually means we have some spammers trying to abuse Wikidot. In any of these cases monitoring helps to keep the service available for legitimate (and paying) users.
We are also planning to change database server around February to a faster machine with new database software: PostgreSQL 9 which greatly simplify and improve database replication.