We are using various techniques to monitor health of Wikidot servers, ranging from our home-brewed scripts, to external services like Pingdom. Actually, Pingdom is really great and their automated services alert us every time there is a low-level problem with Wikidot — e.g. servers are overloaded and cannot handle connections, something crashes, datacenter has network problems etc.
I must say that February was a bit boring. The total downtime (service unavailability) was 2 minutes and 22 seconds, as reported by Pingdom, which gives us a
really good 99.994% uptime!
Now, this starts to be boring…
Now I can tell you why. Over the last year we have been working hard to fix most server-related quirks, find workarounds for PHP bugs, and constantly tune configurations of our servers. We also started replacing old servers with top-notch pieces of hardware, e.g. using super-fast SSD drives for our databases. And for sure — configuring servers to handle 5,000,000 hits every day is, as we have learned the hard way, not a piece of cake.
99.994% uptime makes our Team very happy and confident about our service. I think it is a superb result and I hope we will continue providing Wikidot services non-stop in the future.
I cannot resist and also say that the hosting company, SoftLayer, is really a superb one. Reliability of Wikidot is in a large part also thanks to their remarkable service.
And by the way: we are working on high-availability for all our users and sites, not only those with Pro accounts.
Replacing the server handling uploaded files
In the next few days we will be making a big step forward by replacing the server directly handling file uploads. We are dumping the oldest server we have at SoftLayer, replacing it with a shiny, new, and also much faster one.
Although we have the whole procedure planned, there might be small unexpected problems with uploads or slower serving of uploaded files today and tomorrow. Which, of course, we will try to avoid, but you know, just in case…
If you find out that something terribly wrong is happening to your files, please let us know, preferably as a comment to this blog post. We will keep you up-to-date too.
Migration updates
UPDATE: custom themes based on [[code]] extraction were not accessible yesterday for a few hours, starting from roughly 4 PM UTC, when we started the *.wdfiles.com migration.
This was because of a limitation of PCRE (Perl-compatible regular expression) library that we use massively, including when traversing wiki source to get the contents of code blocks.
The problem was reported at 9:40 PM UTC by RobElliott and was tracked down and solved about two hours later by Michał. What we needed to do was recompilation of this library to use heap instead of stack for dealing with recursive regular expressions (otherwise overflow occurs). We needed to do this for every server before, and we missed just the new one.
From 11:30 PM UTC custom themes work correctly as confirmed by users.
UPDATE2: We have noticed you can't rename files and backups of sites are not done. This is going to be fixed when we end the migration of *.wdfiles.com.
So far we have successfully copied all the user files and have them fully in sync. This includes the changes done during the migration. That was achieved using unionfs. Now we can "merge" the changes back into the original directory and renaming (thus backups) should be working again. Expect these features working till tomorrow.
We started the migration, some of you may have already encountered small problems with uploaded files, custom themes, html blocks and other things served from *.wdfiles.com due to DNS caching issues. So far, everything goes as expected.
Piotr Gabryjeluk
visit my blog
These stats are very very good, and not easy or cheap to achieve. Well done.
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
But if I still can't access my sites by tomorrow afternoon when I have a contractual deadline for a site design then I will be very very pissed off.
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
deleted
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
I am sorry for the problem, but could you tell me what do you mean by "I cannot access my sites"?
One fix I have just made is I had to hand-recompile a few libraries because they were causing segmentation faults on the server. The result was that on some pages, code extraction from wiki pages did not work. This could cause some themes not display properly.
Is this what you have been experiencing? If so, it is fixed now.
Apart from that I cannot see any other problems. If there are other problems caused by migration, we will try to fix them ASAP.
Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me
Uhm, perhaps there's an obvious answer, but why did you do a major change of the servers at the end of the week rather than, for example, Sunday evening when no-one is working? Or, at night or early in the morning? I've lost a couple of hours of access to my sites, during my working day. Next time, it would be better to announce such interruptions before they happen.
Portfolio
There's a proverb that covers this, I think… :-/
Portfolio
pieterh: Rather than rant and advise, you could explain what's your problem. We haven't announced that "you won't be able to work on your sites", because it's not what we expected to happen. You (and everyone) should be perfectly able to work on your sites. If it's not the case, you should report this.
If anyone has problems with their sites, please explain what kind of site it is (private, public, *.wikidot.com or custom domain), are you logged in to Wikidot, and what happens when you click edit.
It's better to state problems when they appear (so we can investigate and fix them) and not 7 hours after, otherwise we're not conscious about users experiencing problems. For us, things were running nice, quick and fasty from the very beginning of the migration.
Piotr Gabryjeluk
visit my blog
I'd used the wrong word, I could of course access them so dealing with text content was no problem, but there was only the base CSS, no other CSS was being applied and no images were being shown so I couldn't do any work on themes.
But it's all back up and running this morning. so no harm done. There's never a good time to do these things is there.
Rob Elliott - Strathpeffer, Scotland - Wikidot first line support & community admin team.
@Piotr, that was not a rant. This is a rant.
When you change IP addresses you know there will be DNS timeout issues. You know that things won't work for one, two, three hours, for random users. This is not mysterious, it is 100% predictable.
Yes, my sites were technically accessible. However I had complaints from users, and collaborators, who asked me why things were not working. I spent some time wondering if there was an SSL issue, caching issues, cookie issues… and then gave up and did other stuff. As far as my online business was concerned, Wikidot was down for a couple of hours.
Yes, it all started working again afterwards.
But there is no excuse for doing this kind of change during working hours. It's unprofessional and lazy. That is what 5am is for, so it all works by the time people wake up and start to use their Wikidot sites.
Now that's a rant. At your service, any time.
Portfolio
Rob, Pieter: Thank you for your explanations. In fact we have been tunneling requests from the old server to the new one, so DNS issues are not the case, obviously. I admit, the tunnel was there after two or three minutes after switching off the web server, but that was intentional, we needed to remount NFS file systems in this time. So technically speaking "things won't work for one, two, three hours, for random users" is not true when explaining that with DNS issues.
So I assume the problem with custom themes was due to the PCRE library limitation, that Michał has solved by recompilation (add raising its limits). For future, this roughly means we need one more test to be added to our test case for checking server condition.
I am really sorry for custom themes not being accessible for that time. However we would appreciate reports of such behaviors when they happen, so we can diagnose them instantly.
As a side note, we already "feel" the speed of the new *.wdfiles.com servers, even though the migration is not yet over. Are they notably faster for you as well?
Piotr Gabryjeluk
visit my blog
@Piotr, no harm done, and sorry for not reporting the issue immediately. Wikidot does seem to be more responsive, sites loading faster, etc. but that may be perception. Good work on the migration, anyhow. It's never simple… and I appreciate the way you try very hard to handle it without bringing the service down.
Portfolio
Pieter.. really out of curiosity, Is there a point in the day that there is not much traffic? I mean you say during working hours… but doesn't night-time overlap with working-time America or Australia… Or IS there a really good moment to do things like this. I would really like to know at what point internet activity is at its lowest.
A - S I M P L E - P L A N by ARTiZEN a startingpoint for simple wikidot solutions.
@Pieter:
In which timezone ?
I think that the Wikidot team should estimate from the webstats which are the actual peak hours and try to avoid them. And a WE makes sense.
gerdami - Visit Handbook en Français - Rate this howto:import-simple-excel-tables-into-wikidot up!
@Steven, sorry for the slow reply. For some reason I've not been getting notifications from this site anymore, so did not see your comments.
Yes, most traffic comes from the USA (40% or so) and the rest mostly from Europe so it's easy to spot the relative calm zone, which is after 7pm or so Eastern time (1am CET), until 8am or so CET. Michal or Piotr can provide accurate figures. It is also lower on the weekend, so afaics the ideal time for maintenance is early on Sunday morning, giving time for issues to settle down before Monday am.
Portfolio
In fact the whole process takes a few days, so there's no good "time of day" to do this. We could have chosen weekend as a good "time of week" to do this migration, but this would mean we wouldn't got live feedback from users and the availability of team members would be significantly lower during those "family days". That's why we carefully planned the operation and started the first part of it during our work-hours, so we could respond quickly to unexpected errors.
Piotr Gabryjeluk
visit my blog
Uhm, Piotr, not to be cynical or negative here, because you did a great job with this upgrade, but your explanation sounds like:
Which are perhaps the wrong message to be telling people. There should be an explicit test framework that hits Wikidot when there are such changes, pulling out hundreds of known sites from across the globe, and verifying that they work. Alternatively, a group of users who are warned and able to provide feedback before the general Wikidot usership wakes up.
I have no comment on the "family days" aspect, personally my work comes before my family, which my wife hates but which does put food on the table.
Portfolio
Pieter, more or less, you're right.
In fact all the time we rely on users to test changes we make as we can't test everything.
For features we test changes ourselves before pushing them live.
For big migrations we plan and test particular tools separately before using them live.
But when a feature or migration goes live, we rely on our users. They have pages so complex that they cause overflows in libraries we use. (Note the problem was only appearing on some, very specific sites, that only power users could create. Probably if we would have started the migration at night, no-one would report those errors before morning, so in fact the downtime for custom themes would be even bigger than it was).
And it's true we don't want to work on weekends. I don't see anything bad in this.
Summing this up, I believe we've been doing this thing very well. Maybe not the best way (we should have remembered about library recompilation), but after all the whole process has been very fluent.
Piotr Gabryjeluk
visit my blog
Actually I have just listened to an interview with Ryan Carson, founder of Carsonified. What struck me is that his idea of running a company is: We are so smart, we do not have to work 5 days a week. So that have ended having a 4-day working week. Which means each weekend is 3 days long for them.
Although the above sound a bit crazy, they are a quite successful company :-)
Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me
I want to work for this company!!!! :D
If slaughterhouses had glass walls, everyone would be vegan. - Paul McCartney
It would be great to have a 4-day working week at Wikidot too. Perhaps some day we will master our efficiency skills so that we can make it happen. I bet we will be getting a lot of emails from job candidates then. :-D
Michał Frąckowiak @ Wikidot Inc.
Visit my blog at michalf.me
I have some private sites for ´business documentation - an the users always were muring (correct word ?) about the slowly page dowbload,, sometimes 3-4 sec.. and this was not the line…
Now it is faster. not only "feeling" !
Service is my success. My webtips:www.blender.org (Open source), Wikidot-Handbook.
Sie können fragen und mitwirken in der deutschsprachigen » User-Gemeinschaft für WikidotNutzer oder
im deutschen » Wikidot Handbuch ?
Nice to hear this from you. We also work on refactoring internally the authorization on private sites, so this will be even faster (and will solve problems for users with strange proxies and infinite redirect loop).
Piotr Gabryjeluk
visit my blog
Everything is unbelievably fast. Amazing job.
Great Job Guys! keep up the good work!
Not sure if this is related, but I can't get to this file anymore.
http://scmapdb.com/local--files/map:classic-arcade-first-person-shooter/0-classic%281%29.jpg
It's listed in the file manager, galleries, etc., but it can't be renamed or anything. It doesn't exist, even though it was never deleted.
Any ideas?
Thanks for reporting. The file was on the server on slightly different name. I suspect someone tried to rename it, but the operation failed in the middle of it, resulting in a file being renamed, but the database record for this file not. Please confirm it works now :-).
Piotr Gabryjeluk
visit my blog
Perfect!
Thanks for the quick fix Gabrys :)
Are you able to add in a few extra checks to make sure that if the file rename process doesn't complete successfully, all changes made are rolled back to their original state before the process began?
I've been taught that any transaction in a database should either result in a commit or a rollback. If a problem occurs half way through, it should always rollback everything that was done.
~ Leiger - Wikidot Community Admin - Volunteer
Wikidot: Official Documentation | Wikidot Discord server | NEW: Wikiroo, backup tool (in development)