Last Friday I started drafting a new blog post about Wikidot server uptime. Every month Pingdom sends us a monthly report with average response times, uptime summaries for various checks we have. The numbers are always stellar — something around 99.99% uptime with only a few minutes of detected outages.
Friday night, a thriller story
Our infrastructure is pretty stable and resilient, so I was hoping to get 100% uptime on November. Suddenly on Friday evening we started getting various alerts, from Pingdom, CloudWatch and other services. Something was wrong. Wikidot was slow — loading a page could take up to a minute, sometimes pages could not load at all. Pingdom alerts acted like crazy — various wikis went down and up radomly.
Any service works, until it doesn't. ELB is no exception.
We immediately jumped to computers. Friday evening, a perfect time for an emergency situation. But everything seemed to be O.K. with our servers. No excessive load, no increased number of connections. Nothing. Even request rate per second was within limits.
It took us hours of experiments, replacing web servers, altering parameters to discover the problem was beyond our control. The issue was introduced by AWS Elastic Load Balancers that distribute web requests over our farm of servers.
Actually we have 6-10 web servers, and over 20 servers altogether.
ELBs are a great piece of AWS infrastructure, with auto-scaling, failover etc. But they are black boxes. Sure you get some metrics, logs, but nothing more. You cannot control critical parameters, you cannot scale it manually, you cannot see what exactly is going on with them when things go wrong.
Client (browser) connections were stuck in SYN_SENT state (the connection was not fully established). It might have been caused by internal resource starvation at balancers, a SYN flood attack or any misconfiguration. Anyway, balancers could not handle incoming connections and we had no idea why.
Yes, we fixed it. Well, kind of.
Eventually we set up a new ELB to handle traffic to Wikidot and, surprisingly, it started working out of a box. Problem solved, we thought. I went to bed at 3 AM believing the issue is solved. AWS support was notified.
It did not work for long. Saturday evening the problem repeated. We did a quick trick — since we had the old balancer around (did not delete it), we switched traffic to the old balancer. It helped… for a few hours. We had to launch a brand new balancer, but now we suspected (correctly) the problem would repeat.
Again and again…
The same happened on Sunday and we were really tired of the situation. We were doing everything we could to keep Wikidot online and responsive, but it was not easy and we were losing hope. We were trying to distribute traffic over balancers to mitigate the issue. It helped, but not for long.
The situation continued on Monday. Despite our efforts some users were getting timeout errors, our alerts were indicating performance issues. We have ruled out a possibility of a SYN flood attack, there was clearly something wrong with ELBs.
HAProxy to the rescue!
After unfruitful conversation with AWS support we made the decision yesterday to ditch ELBs and use open source HAProxy load balancers to handle our traffic. Despite we had to configure SSL termination, make HAProxy automatically add/remove web servers and, most importantly, prepare it to handle hundreds requests per second — we did it within hours. By evening we had a working setup and we made the switch.
It's cheaper and it works better.
Guess what, HAProxy is amazing. Two c3.large instances are handling traffic to wikidot.com without sweat. Encouraged by this we added traffic from wdfiles.com (serves user-uploaded files) too (ditching one more ELB). Each HAProxy balancer handles up to 200 requests per second with less than 10% CPU load. Not to mention it's not going spontaneously offline nor exhibiting any strange issues. Rock-solid.
HAProxy gives us much more control over the connection handling — connection timeouts, rate limiting, abuse handling, anti-DDOS settings and much, much more. ELB is great as it works out-of-a-box, but as we learned the hard way, it's not a one-tool-fits-all thing.
Actually, it was not the first time we had issues with ELBs. Twice in the last 2 years we had to call support to do their VooDoo tricks to make ELBs work properly with wdfiles.com. But this time the explanation we got from AWS did not make much sense, nor did anything good at all.
We could not wait any longer for any solution. When I am writing this we still have not got any response apart from saying it's a "scaling issue" and balancers are still broken. I am glad we did not wait, so far HAProxy rocks!
Be up to date, follow @wikidot on Twitter!
We have been reporting critical milestones of the operation on our Twitter account. If you have a critical wiki at Wikidot, want to be the first to know what's going on — follow @wikidot!
Last but not least
If you have experienced problems with Wikidot in the last few days, I am terribly sorry. Things happen, some are beyond our control. One thing for sure — Wikidot reliability is our priority and we always do what we can to keep your wikis online, responsive and secure. We do our best!
On Mon I spent a while on phone with David who is managing the ELB team. He confirmed there was a scaling issue and connection saturation problem on our ELB. Indeed we had extra traffic starting 1st Nov of non-human nature and this escalated the issue.
David, thanks for spending time explaining the problem. I highly appreciate it!
Since people keep asking me about our HAProxy setup, here is a short post about it.