The news that Wikipedia has announced that it will go dark might encourage others to protest in the same way - but be careful how you do it. There is a right way and a wrong way to black out a website.
First the latest update on the progress of the proposed strike.
Despite the shelving of SOPA until a consensus is reached, and some presidential opposition to the whole idea, PIPA is still on the table. Most observers think that this is just a temporary truce - the war will continue when the odds are better for a SOPA like bill
As a result Wikipedia, with an estimated 25 million visitors per day, is going dark on the 18th. Wikipedia is by far the largest site to join in and more importantly it serves "normal" users and not just geeks and nerds like us. Interestingly Twitter probably isn't joining in as indicated by a Tweet from its CEO Dick Costolo:
"That's just silly. Closing a global business in reaction to single-issue national politics is foolish."
A single issue national politics that has a global effect might be a different matter. Perhaps Twitter, well known for its "fail whale" icon fears that it might not be able to restart after a blackout in the style of mainframes of the past.
For more information see the further reading at the end of this news item.
Political and social protest by removing websites that provide a service to users who might otherwise be unaware of what is going on seems to be catching on. However if you want to join in it is important to do it right if you want to avoid consequences that last for a lot longer than the blackout period.
The problem is all to do with the way bots and spiders scan a site to build and keep an index up-to-date. Google's Pierre Far thinks that this is so important he has written a Google+ (where else) post on the subject.
The advice applies even to the situation where you need to take a site down for a day or two for maintenance reasons. If you put up a replacement page then the site will be reindexed with that page being regarded as new content. The best solution is to return a 503 HTTP header for all URLs that are being blacked out. This signals to the bots that the page is not the real content and the state is temporary. If you are using PHP this is just a matter of:
header('HTTP/1.1 503 Service Temporarily Unavailable'); header('Status: 503 Service Temporarily Unavailable'); header('Retry-After: time'); // time in seconds
You can add this to a page that is returning the temporary content i.e.the protest page or the apology for the service being down.
If a lot of pages report the 503 status then the Google bot will reduce the frequency that it scans the site. When you return to normal service the bot will eventually notice that you are back and continue where it left off.
Some small suggestions to make things even easier are:
don't set robots.txt to return 503 unless you really want to stop the crawl bots from accessing your entire site until the robots.txt file returns 200 or 204. In other words if you are only switching off part of the site don't set robots.txt to return 503.
don't add Disallow / to robots.txt in an attempt to stop the crawl bot from accessing the site - it can cause long term problems in getting the bot to re-crawl the site.
Check with management sites to make sure your site is being crawled after the blackout is over e.g. Google Webmasters . Don't expect things to return to normal at once.
The final advice from Pierre Far is to keep it simple, don't change DNS or crawl frequency setting for example. Also don't use 302 redirects- 503 or any of the 5xx will do the job reliably.
Google Code-in is a contest that introduces teenagers to the world of open source. It takes place entirely online and is open to students between 13 and 17. Now in its seventh year GCI runs until Janu [ ... ]