ScraperWiki sets data free Now with PDF support
Wednesday, 29 December 2010 10:36

Banner

 

ScraperWiki is a really great idea. Scraping in the technique of retrieving data from HTML pages. Data embedded in an HTML page is usually formatted for human consumption and this generally means that it isn't the best format for other applications to process. A scraper is a program designed to download the HTML page and extract that data and then present it in a format another program can use - usually XML or JSON.

scraper

What is surprising is that there is so much data on the web which is only available embedded in HTML. Often a government department will make data available in a web page but then either not have the resources or the inclination to make it available for further processing - but scraping can deliver it in a usable form.

The problem with scraping is that HTML is not easy to process to extract data - it is often not regular enough and it sometime even changes its form with web site updates. So what you need is an easy way to create a scraper and after that why not share the data that it has retrieved for everyone to use. This is the idea of ScraperWiki.

It provides a number of online templates in PHP and Ruby to get a head start on creating a scraper. The approach taken is to construct a DOM tree and then extract the data by navigating and manipulating the DOM. This really is the only sensible way to create a scraper and once you have seen an example it is fairly easy.  For non-programmers there is a "request a scraper" facility where members of the Wiki will spend a few minutes building a custom scraper.  You can also volunteer to fix a broken scraper or document an existing one. At the time of writing there are 58 suggested datasets needing scrapers.

The data obtained by ScraperWiki can be downloaded as as CSV file and shared with other users. The whole thing is open source and so are any scrapers you create. The idea is to free up data that is otherwise locked into HTML. Scrapers can be run on a schedule and you get an email if your scraper fails. There is also an API that allows clients to download from the datastore in either JSON, YAML, SML, PHP objects or CSV.

scraper

The whole system has been up and running for about a year and is now in beta testing - although in common with many open source projects it may well say in beta for longer than actually needed. It all seems to work perfectly well.

The latest feature is a PDF to HTML converter which opens up the possibility of PDF scraping. To quote the ScraperWiki blog:

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Once converted to HTML the same scraper tools can be used to extract the data from what is often called the largest component of the "dark web", i.e. data hidden from search by being within a PDF.

More information

http://scraperwiki.com/

http://blog.scraperwiki.com/

 

Banner


Is Excel To Blame For Our Economic Pain?
22/04/2013

Two economists whose work has been used to argue against government spending to revive the US economy have acknowledged some fundamental spreadsheet blunders. This has led to other Excel errors being  [ ... ]



Massive Online Master's Degree in Computer Science
15/05/2013

Georgia Tech and Udacity are joining forces to offer an MSc in Computer Science to be delivered as a massive open online course with enhanced support services for students enrolled in the degree progr [ ... ]


More News

Last Updated ( Wednesday, 29 December 2010 13:15 )
 
 

   
RSS feed of news items only
I Programmer News
Copyright © 2013 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.