26 August, 21:13 CET: for what it's worth, there are two independent live
backups of the database. No matter how bad wikidot.com crashes, you will
not lose your edits or sites. Every single edit is replicated off-site,
twice. Tomorrow we will re-upgrade to the latest software version. This
should happen without any downtime.
26 August, 21:00 CET: everything except email notifications is working now.
We'll have to tackle that tomorrow. Sorry folks, no edit notifications today!
I assume they won't be lost, and will all arrive tomorrow once we get this
sorted.
26 August, 19:37 CET: custom CSS and images are working now. We all cheer
Piotr, our networking guru, who has found a way to get around SoftLayer's
totally broken "portable IP address" (as portable as a rock!)
26 August, 19:20 CET: Remaining issues: custom CSS and images are not all
working. Software version is one week old, so some things like the link to the
previous blog posting here do not work. Email notifications are not being
sent.
26 August, 19:07 CET: "It looks like we're back in business, folks."
Spoke too soon. Still some problems with IP addresses and firewalls that make
custom CSS and attachments not work properly. However the excellent news is
that the server is stable and load is coming down to normal levels. So… it
really was snowleopard.wikidot.com destroying our service. The irony is that
this site did not even do any heavy work. Just a long page being fetched far
far too often by impatient Mac users. And from the cache! Time to implement
maximum connection limits on free sites.
26 August, 18:58 CET: We were twittered! Search "wikidot" on twitter.com
and there are dozens and dozens of links to the offending site. Oh, and one
nice person who tweets "Fxxk you, wikidot, fxxk you. Seriously. You're going
down, and you're going to LIKE IT." Sigh. It has been a long day for all
of us.
26 August, 18:47 CET: It looks like we're back in business, folks.
26 August, 18:42 CET: Yay! wdfiles works. Some random firewall setting
that was blocking HTTP requests to the new IP address we set that domain to.
The next five minutes will tell whether Wikidot.com is back and stable, or
just about to die again.
26 August, 18:32 CET: Looks like it was an accidental denial-of-service by
a bunch of crazy OS/X fanatics desperately looking for information on a thing
called "Snow Leopard". Still no happiness on wdfiles.com though.
26 August, 18:29 CET: Mutterings of "what the?" and "that's bizarre!" from
Piotr as he debugs the problems with wdfiles.com. Your sites are fast enough
but the browser is timing out on fetching CSS.
26 August, 18:22 CET: Main server is looking much better now. CPU is
stable, load average is also stable, and network traffic is slowly coming back
to normal levels. Now just that proble with wdfiles.com to fix…
26 August, 18:19 CET: The possibly-offending site is offline, DNS back to
normal and now just the wdfiles domain acting strange. This means that CSS
themes are not working and you may see the browser waiting for a long time to
try to resolve the wdfiles.com domain.
26 August, 18:04 CET: It looks like some specific very high-volume web
sites might have been causing the problem. Just too many people trying to hit
the server at once… we're disabling that particular site and reloading the
Matrix.
26 August, 17:48 CET: SoftLayer tells us the portable IP addresses are now
working.
26 August, 17:33 CET: Welcome back. You're getting this page instead of a
useless "500 Internal Server Error" on the Wikidot.com site you're trying to
access.
26 August, 17:25 CET: Configuring the Lighttpd server so that the "500"
error message redirects automatically to this page. Looking at
www.lighthttpd.org and realize it's actually hosted at Wikidot.com! Oops.
Trying www.lighttpd.org instead but that does not respond. Has the whole
Internet died today? Ah, Google finds lighttpd.net…
26 August, 17:16 CET: "500 - Internal Server Error". The problem is back.
CPU usage at 100% (on all CPUs and the web service is giving errors). We're
switching the DNS back to show this page. No point in delivering tens of
thousands of error messages.
26 August, 17:03 CET: When the DNS updates are done, all sites in the
wikidot.com domain should work. So should all custom domains that resolve to
"wikidot.com". To test this, try "ping sitename", and if it tells you
"74.86.235.236", then you will be OK (or whatever passes for "OK" right now).
26 August, 16:55 CET: To heck with SoftLayer. We're switching the DNS
back to the main server, using a fixed IP address. It should take 15 minutes
for the DNS to refresh. Then we'll all be back on the main server and either
it will work, finally, or still crash. We are, incidentally, running a week-old
version of Wikidot.com so some sites may not work perfectly.
26 August, 16:52 CET: Still no answer from SoftLayer :-/… but we keep
asking.
26 August, 16:27 CET: Waiting for SoftLayer to fix the IP addresses. Custom
domains using the main server will almost work, but won't fetch stylesheets,
and won't load images, nor let you log in.
26 August, 16:15 CET: We're talking to SoftLayer, asking why the portable
IPs don't work. They are looking into it. Hopefully they're faster than us at
solving technical issues. :-/
26 August, 16:10 CET: We have restarted the Wikidot.com service on the main
server and some custom domains are now using it. CPU load looks normal but
accessing web sites is very slow due. We're not back to normal, yet.
26 August, 16:03 CET: SoftLayer offers "portable IP addresses" but they
really do not seem to work. We may need to move to a different hosting
provider. It is essential to be able to switch to another box but keep the
same IP address.
26 August, 15:50 CET: We have spent the last half day testing different
parts of the service. We can now exclude firewall problems, denial-of-service
attacks, database load, and network traffic. The remaining culprit is the
software that runs Wikidot.com. We are now reverting to older versions to see
if a recent change was responsible. This is a slow process, sorry!
26 August, 15:43 CET: This page is at 74.86.234.147, the backup server.
The main server is at 67.228.37.26. We switched the DNS so that if you use
somename.wikidot.com, you will see this page. If you use a custom domain
and configured your DNS to use the 67.228.37.26 IP address, you may still be
landing on the main server.
26 August, 15:37 CET: Wikidot.com has switched to a backup server but that
is thrashing (out of physical memory). Some people are reporting successful
web site loads but overall it is not working. We are taking down the service
and replacing it with this static HTML page.
26 August, 14:00 CET: It's been a frustrating morning of "500 Internal
Server Error" and failed AJAX requests. Frustrating for us, unable to find
the cause of this, and frustrating for the thousands of people trying to
access their websites.
Thanks for your hard work! As of 6:15 pm CET my site is now up, but it's still very slow. Please keep the status updates coming.
Speed good now. Thank you!!
We love you. Thanks for everything. Rock on.
"Long ago, in a place both familiar and strange…"
Thanks for informing on what was wrong and putting up that "no it's not just you" update-page! The sites are still very slow to load, but they're loading. There's also issues with getting images and custom CSS, but it seems normal for now?
Good work on getting the site back up that fast! (you guys are definitely getting a new pro user what with how you didn't just stay quiet and leave us to look for news feeds with explanations.)
And just wanted to ask, were the reoccuring "nasty error"s (the error message's wording, not mine) that the site gave me when I was trying to update my site linked to this? I can't give you the exact error code, but it happened around nine or eight hours ago and seemed to cancel my last action. Repeating my actions enough got the update through unscathed.
The "nasty errors" are a quaint way of saying "oh noes, we are in trouble".
Images and custom CSS are still fragile. Something with the IP addresses. When we switched over to the backup server (which was kind of stupid because it never had enough RAM to handle normal traffic), we also reorganized the IP addresses and that is taking time to stabilize.
Portfolio
Ooh, I see now! Thanks for clearing it up.
And no pressure on my ultimately inconsequencial side for the CSS and images; as long as the text loads up, I'm happy even without the eyecandy.
(I'm sure that you must hear it often, but I'm glad Wikidot is around and isn't going to be taken down with "just a flesh wound". Thank you and the rest of the staff for this. *hearts*)
I'd like to add my name to the list of those who really appreciated the very frequent status updates. There's nothing worse than having a broken service and be left in the dark. I'm glad things are back up and running. Performance seems pretty snappy right now.
Those !@#$% firewalls! I was beating my head against the wall a few weeks ago with a networking issue and it turned out to be the firewall - the last thing I looked at. I can't image how fun it is to try and manage the firewall for a large service like Wikidot.
-Ed
Community Admin
Definitely, when I finally realised there was a static page up, it already had quite a few updates on there… and it continued to be updated. A great help when people started bombarding me with questions about why the site was down! Thanks :)
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Yes, it was really good to see the very frequent updates coming through, and not just on wikidot.com but on all sites. A tough day for you all but an example of excellent communication.
Rob
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
Thanks, but I'm quite dissatisfied by our response. It took far too long to get a static web page up and to analyze the real cause of the problem (which was quite simply way too many people hitting the server).
Of course we're going to be changing the server architecture to be more scalable, but for me it's more important to get a fast communication channel in case of trouble. I think, when we get those "500" errors, they should always link to a static page held somewhere else. That already tells visitors "we know about it" if we do. And second, once we know the service is collapsing, we should just send all visitors to a static page first, whereupon they can click through to get the site they're trying to visit.
We also absolutely need to impose limits on how much individual sites can smash the server. I'm still not sure how a single site managed to do quite such damage. But since we took it down, everything is much faster, as you see.
We learned a lot today, though, and we hugely appreciate everyone's patience with this. I know how unpleasant it is to be without Wikidot. All day, I sat there looking at my email notifications from yesterday thinking, "if I read them, I won't be able to do anything anyhow… gaagh!!!"
Portfolio
It might have something to do with the fact that every time I searched for "wikidot" on twitter…. literally 9/10 tweets had snowleopard.wikidot.com linked in them o.O
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Agreed.
gerdami - Visit Handbook en Français - Rate this howto:import-simple-excel-tables-into-wikidot up!
Thanks for the info, it really calmed me down… I sent an email to a google wikidot mailing group, but is seems you dont visit it any more…
If slaughterhouses had glass walls, everyone would be vegan. - Paul McCartney
You should visit their twitter for updates when the wikidot servers are down :)
http://www.twitter.com/wikidot
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
Hi Pieter,
Dan here, Master Administrator of the snowleopard.wikidot.com site.
Sorry for the trouble we caused…it was completely unintentional as I'm sure you know. We are just very excited for the release of Snow Leopard.
I'm glad wikidot is back up and running for the most part, but I saw that our wiki is disabled. Since it is such a critical time (the new OS will be released the day after tomorrow), and it has proven to be an invaluable resource for Mac users, I am hoping it will go back up as soon as possible. If necessary, I would be more than happy to upgrade to a paid account. I was thinking we might also split the page up into 2 or 3 sections, to hopefully lighten the load on the servers.
Sorry again for the problems and good luck with the system upgrades.
I hope to hear from you soon.
-Dan
I'm no programmer but, sounds like that something entered a loop that couldn't be stopped. Or some very rare condition has been met that triggered a chain reaction.
From what I read, what happened is simple. Too many people tried to access the same thing.
@wikidot my suggestion, is to have a small redirect traffic program, that runs to redirect traffic, if the cpu level is increasing rapidly.
Hi Dan,
You'll have seen the site is back up. We put on extra caching. What we need to add is live traffic analysis so we can catch this kind of thing next time. It's really not your fault: we all dream of making successful sites and you certainly hit the big time!
Thanks for using Wikidot.
-Pieter
Portfolio
Pieter,
Since Google Analytics is installed by default on all websites, can you tell us how many hits on this one yesterday ?
gerdami - Visit Handbook en Français - Rate this howto:import-simple-excel-tables-into-wikidot up!
Google Analytics tells me that the peak was on Tuesday, with over 41,000 visits, while on Wednesday, despite all the downtime, there were over 22,000 visits.
Hi,
Thanks for the information, good luck and calm down :-)
Cheers,
Angela
Thanks for busting your buns trying to get it all working again. We all appreciate what you do.
In case you're not aware of this:
Whenever I edit and save a page — or post a comment — it fails to update the page unless I hit reload manually.
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)
That's strange, because I've been editing for a couple of hours now without any problems. Page reloads automatically.
So have I.
@leiger
what OS/browser?