Wikidot Service Problems

by pieterh on 26 Aug 2009 09:07

26 August, 21:13 CET: for what it's worth, there are two independent live
backups of the database. No matter how bad wikidot.com crashes, you will
not lose your edits or sites. Every single edit is replicated off-site,
twice. Tomorrow we will re-upgrade to the latest software version. This
should happen without any downtime.

26 August, 21:00 CET: everything except email notifications is working now.
We'll have to tackle that tomorrow. Sorry folks, no edit notifications today!
I assume they won't be lost, and will all arrive tomorrow once we get this
sorted.

26 August, 19:37 CET: custom CSS and images are working now. We all cheer
Piotr, our networking guru, who has found a way to get around SoftLayer's
totally broken "portable IP address" (as portable as a rock!)

26 August, 19:20 CET: Remaining issues: custom CSS and images are not all
working. Software version is one week old, so some things like the link to the
previous blog posting here do not work. Email notifications are not being
sent.

26 August, 19:07 CET: "It looks like we're back in business, folks."
Spoke too soon. Still some problems with IP addresses and firewalls that make
custom CSS and attachments not work properly. However the excellent news is
that the server is stable and load is coming down to normal levels. So… it
really was snowleopard.wikidot.com destroying our service. The irony is that
this site did not even do any heavy work. Just a long page being fetched far
far too often by impatient Mac users. And from the cache! Time to implement
maximum connection limits on free sites.

26 August, 18:58 CET: We were twittered! Search "wikidot" on twitter.com
and there are dozens and dozens of links to the offending site. Oh, and one
nice person who tweets "Fxxk you, wikidot, fxxk you. Seriously. You're going
down, and you're going to LIKE IT.
" Sigh. It has been a long day for all
of us.

26 August, 18:47 CET: It looks like we're back in business, folks.

26 August, 18:42 CET: Yay! wdfiles works. Some random firewall setting
that was blocking HTTP requests to the new IP address we set that domain to.
The next five minutes will tell whether Wikidot.com is back and stable, or
just about to die again.

26 August, 18:32 CET: Looks like it was an accidental denial-of-service by
a bunch of crazy OS/X fanatics desperately looking for information on a thing
called "Snow Leopard". Still no happiness on wdfiles.com though.

26 August, 18:29 CET: Mutterings of "what the?" and "that's bizarre!" from
Piotr as he debugs the problems with wdfiles.com. Your sites are fast enough
but the browser is timing out on fetching CSS.

26 August, 18:22 CET: Main server is looking much better now. CPU is
stable, load average is also stable, and network traffic is slowly coming back
to normal levels. Now just that proble with wdfiles.com to fix…

26 August, 18:19 CET: The possibly-offending site is offline, DNS back to
normal and now just the wdfiles domain acting strange. This means that CSS
themes are not working and you may see the browser waiting for a long time to
try to resolve the wdfiles.com domain.

26 August, 18:04 CET: It looks like some specific very high-volume web
sites might have been causing the problem. Just too many people trying to hit
the server at once… we're disabling that particular site and reloading the
Matrix.

26 August, 17:48 CET: SoftLayer tells us the portable IP addresses are now
working.

26 August, 17:33 CET: Welcome back. You're getting this page instead of a
useless "500 Internal Server Error" on the Wikidot.com site you're trying to
access.

26 August, 17:25 CET: Configuring the Lighttpd server so that the "500"
error message redirects automatically to this page. Looking at
www.lighthttpd.org and realize it's actually hosted at Wikidot.com! Oops.
Trying www.lighttpd.org instead but that does not respond. Has the whole
Internet died today? Ah, Google finds lighttpd.net…

26 August, 17:16 CET: "500 - Internal Server Error". The problem is back.
CPU usage at 100% (on all CPUs and the web service is giving errors). We're
switching the DNS back to show this page. No point in delivering tens of
thousands of error messages.

26 August, 17:03 CET: When the DNS updates are done, all sites in the
wikidot.com domain should work. So should all custom domains that resolve to
"wikidot.com". To test this, try "ping sitename", and if it tells you
"74.86.235.236", then you will be OK (or whatever passes for "OK" right now).

26 August, 16:55 CET: To heck with SoftLayer. We're switching the DNS
back to the main server, using a fixed IP address. It should take 15 minutes
for the DNS to refresh. Then we'll all be back on the main server and either
it will work, finally, or still crash. We are, incidentally, running a week-old
version of Wikidot.com so some sites may not work perfectly.

26 August, 16:52 CET: Still no answer from SoftLayer :-/… but we keep
asking.

26 August, 16:27 CET: Waiting for SoftLayer to fix the IP addresses. Custom
domains using the main server will almost work, but won't fetch stylesheets,
and won't load images, nor let you log in.

26 August, 16:15 CET: We're talking to SoftLayer, asking why the portable
IPs don't work. They are looking into it. Hopefully they're faster than us at
solving technical issues. :-/

26 August, 16:10 CET: We have restarted the Wikidot.com service on the main
server and some custom domains are now using it. CPU load looks normal but
accessing web sites is very slow due. We're not back to normal, yet.

26 August, 16:03 CET: SoftLayer offers "portable IP addresses" but they
really do not seem to work. We may need to move to a different hosting
provider. It is essential to be able to switch to another box but keep the
same IP address.

26 August, 15:50 CET: We have spent the last half day testing different
parts of the service. We can now exclude firewall problems, denial-of-service
attacks, database load, and network traffic. The remaining culprit is the
software that runs Wikidot.com. We are now reverting to older versions to see
if a recent change was responsible. This is a slow process, sorry!

26 August, 15:43 CET: This page is at 74.86.234.147, the backup server.
The main server is at 67.228.37.26. We switched the DNS so that if you use
somename.wikidot.com, you will see this page. If you use a custom domain
and configured your DNS to use the 67.228.37.26 IP address, you may still be
landing on the main server.

26 August, 15:37 CET: Wikidot.com has switched to a backup server but that
is thrashing (out of physical memory). Some people are reporting successful
web site loads but overall it is not working. We are taking down the service
and replacing it with this static HTML page.

26 August, 14:00 CET: It's been a frustrating morning of "500 Internal
Server Error" and failed AJAX requests. Frustrating for us, unable to find
the cause of this, and frustrating for the thousands of people trying to
access their websites.

Comments: 26

Add a New Comment