"Scaling issue" - what happened over the weekend and why HAProxy is better than ELB

by michal-frackowiak on 05 Nov 2014 12:43

Last Friday I started drafting a new blog post about Wikidot server uptime. Every month Pingdom sends us a monthly report with average response times, uptime summaries for various checks we have. The numbers are always stellar — something around 99.99% uptime with only a few minutes of detected outages.

Friday night, a thriller story

Our infrastructure is pretty stable and resilient, so I was hoping to get 100% uptime on November. Suddenly on Friday evening we started getting various alerts, from Pingdom, CloudWatch and other services. Something was wrong. Wikidot was slow — loading a page could take up to a minute, sometimes pages could not load at all. Pingdom alerts acted like crazy — various wikis went down and up radomly.

Any service works, until it doesn't. ELB is no exception.

We immediately jumped to computers. Friday evening, a perfect time for an emergency situation. But everything seemed to be O.K. with our servers. No excessive load, no increased number of connections. Nothing. Even request rate per second was within limits.

It took us hours of experiments, replacing web servers, altering parameters to discover the problem was beyond our control. The issue was introduced by AWS Elastic Load Balancers that distribute web requests over our farm of servers.

Actually we have 6-10 web servers, and over 20 servers altogether.

ELBs are a great piece of AWS infrastructure, with auto-scaling, failover etc. But they are black boxes. Sure you get some metrics, logs, but nothing more. You cannot control critical parameters, you cannot scale it manually, you cannot see what exactly is going on with them when things go wrong.

Client (browser) connections were stuck in SYN_SENT state (the connection was not fully established). It might have been caused by internal resource starvation at balancers, a SYN flood attack or any misconfiguration. Anyway, balancers could not handle incoming connections and we had no idea why.

Yes, we fixed it. Well, kind of.

Eventually we set up a new ELB to handle traffic to Wikidot and, surprisingly, it started working out of a box. Problem solved, we thought. I went to bed at 3 AM believing the issue is solved. AWS support was notified.

It did not work for long. Saturday evening the problem repeated. We did a quick trick — since we had the old balancer around (did not delete it), we switched traffic to the old balancer. It helped… for a few hours. We had to launch a brand new balancer, but now we suspected (correctly) the problem would repeat.

Again and again…

The same happened on Sunday and we were really tired of the situation. We were doing everything we could to keep Wikidot online and responsive, but it was not easy and we were losing hope. We were trying to distribute traffic over balancers to mitigate the issue. It helped, but not for long.

The situation continued on Monday. Despite our efforts some users were getting timeout errors, our alerts were indicating performance issues. We have ruled out a possibility of a SYN flood attack, there was clearly something wrong with ELBs.

HAProxy to the rescue!

After unfruitful conversation with AWS support we made the decision yesterday to ditch ELBs and use open source HAProxy load balancers to handle our traffic. Despite we had to configure SSL termination, make HAProxy automatically add/remove web servers and, most importantly, prepare it to handle hundreds requests per second — we did it within hours. By evening we had a working setup and we made the switch.

It's cheaper and it works better.

Guess what, HAProxy is amazing. Two c3.large instances are handling traffic to wikidot.com without sweat. Encouraged by this we added traffic from wdfiles.com (serves user-uploaded files) too (ditching one more ELB). Each HAProxy balancer handles up to 200 requests per second with less than 10% CPU load. Not to mention it's not going spontaneously offline nor exhibiting any strange issues. Rock-solid.

HAProxy gives us much more control over the connection handling — connection timeouts, rate limiting, abuse handling, anti-DDOS settings and much, much more. ELB is great as it works out-of-a-box, but as we learned the hard way, it's not a one-tool-fits-all thing.

Actually, it was not the first time we had issues with ELBs. Twice in the last 2 years we had to call support to do their VooDoo tricks to make ELBs work properly with wdfiles.com. But this time the explanation we got from AWS did not make much sense, nor did anything good at all.

We could not wait any longer for any solution. When I am writing this we still have not got any response apart from saying it's a "scaling issue" and balancers are still broken. I am glad we did not wait, so far HAProxy rocks!

Be up to date, follow @wikidot on Twitter!

We have been reporting critical milestones of the operation on our Twitter account. If you have a critical wiki at Wikidot, want to be the first to know what's going on — follow @wikidot!

Last but not least

If you have experienced problems with Wikidot in the last few days, I am terribly sorry. Things happen, some are beyond our control. One thing for sure — Wikidot reliability is our priority and we always do what we can to keep your wikis online, responsive and secure. We do our best!

Best,
Michal

Update

On Mon I spent a while on phone with David who is managing the ELB team. He confirmed there was a scaling issue and connection saturation problem on our ELB. Indeed we had extra traffic starting 1st Nov of non-human nature and this escalated the issue.

David, thanks for spending time explaining the problem. I highly appreciate it!

Update 2

Since people keep asking me about our HAProxy setup, here is a short post about it.

Comments: 5

Hide All Comments Unfold All Fold All

Fold

leiger 06 Nov 2014 00:01

You guys are awesome. Keep up the good work!

~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)

Reply Options

Unfold by

leiger, 06 Nov 2014 00:01

Fold

Helmut_pdorf 06 Nov 2014 07:14

Thanks for information, Michal!

We can imagine the hard work you and your team are doing.
it always happens in the worst moment at weekend… :(

Service is my success. My webtips:www.blender.org (Open source), Wikidot-Handbook.

Sie können fragen und mitwirken in der deutschsprachigen » User-Gemeinschaft für WikidotNutzer oder
im deutschen » Wikidot Handbuch ?

Last edited on 06 Nov 2014 07:27 by Helmut_pdorf Show more

Reply Options

Unfold by

Helmut_pdorf, 06 Nov 2014 07:14

Fold

$michal-frackowiak$ michal-frackowiak 06 Nov 2014 09:19

Believe me, Friday night is not THE worst moment for things to break. It's been worse :-)

Anyway, I rather see it as an opportunity to learn something new and so something cool. Sure, it is frustrating like hell to fight issues at night, but this time we learned how to use HAProxy, we learned the downsides of ELBs (which we will NOT use any more), we made some noise about ELB vs HAProxy. With every edge situation our experience level-up.

Last but not least, after a difficult night we could go downtown and play some Ingress :-)

Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me

Reply Options

Unfold by $michal-frackowiak$ michal-frackowiak, 06 Nov 2014 09:19

Fold

$michal-frackowiak$ michal-frackowiak 13 Nov 2014 08:40

I have added an update to the blog post. The whole issue deserves a separate blog post, but the good thing is that the issue has been tracked and nailed.

Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me

Reply Options

Unfold by $michal-frackowiak$ michal-frackowiak, 13 Nov 2014 08:40

Fold

tchilpi 23 Mar 2020 01:59

thanks again for everything

Reply Options

Unfold by

tchilpi, 23 Mar 2020 01:59

Add a New Comment

page revision: 3, last edited: 20 Nov 2014 10:06

Edit Tags Discuss (5) History Files Print Site tools + Options