Performance - the never-ending process

by michal-frackowiak on 06 Jun 2012 21:09

Honestly, the last few weeks have been really difficult for our team. Mainly because of the performance of our servers and Wikidot service itself. Performance-related tickets were opened on our Feedback site: some users could not see their own posts, others double-posted their messages, tags were taking ages to change on some sites. Moreover activities were not updated in real-time for some users and errors were being thrown when searching sites from time to time.

Such problems are not "deadly", but fixing them is time consuming and require a lot of effort to diagnose and solve.

When things are good, they are good. There are weeks or even months when there are literally no issues with our infrastructure. But when things start to break, they break all at once, or at least appear to. Some problems can create chain reactions, impacting other elements of the system.

Our first bottleneck was the database that stores user activities. When its performance dropped, it affected other component of Wikidot — queries were timing out, load on some servers was increased and available resources were limited.

One day we discovered locks in our primary database. Most likely because concurrent attempts to update the same content. Some pages (or tags on certain sites) could not be edited without a manual release of the blocking database queries. This was strange and we believe it came from our web workers (PHP) that started dying without releasing resources and closing database transactions.

I believe we have resolved most of the performance issues now. Search still fails from time to time — availability is around 99.7% over all the sites right now, which means there might be a few minutes per day when you get errors when searching sites.

We have also tuned up low level server configuration (TCP stack) to improve performance under high load and traffic.

To make things even more bizarre — a medium-size botnet launched a DDOS attack on Wikidot yesterday, opening thousands of connections to our servers hoping to bring our websites down. Just when we thought we had all performance issues fixed :-/

We have no idea what the real target was — probably one of the sites we host. Fortunately this costed us no more than 2 minutes of unavailability of one of our web servers. We were surprised to find out the attack originated in countries like Kazachstan, India, Quatar, Serbia, Indonesia, Russian Federation, Belarus, Turkey, Kyrgyzstan.

As you can see during the last few weeks our whole team has been focused on improving performance and fixing issues so that nothing would affect sites we host, nor the way you use Wikidot. Now, when most of the issues are fixed, we will take a short breath and get back to work on other pending improvements.

BTW: Wikidot T-shirts and other gadgets should be available soon! Stay tuned!

Comments: 10

Add a New Comment