Home > site_scrapper

site_scrapper

Site_scrapper is a project mainly written in ..., it's free.

Code to scrap the planning commissions NGO partnership pages. It uses Nokogiri and SelectorGadget (thanks to Ryan Bates).

Recently we came across the need to generate analytics about the Indian social sector and to do the same we needed an insane amount of raw data. However, the Indian social sector is at best fragmented and there was no efficient way to get the data. Without the use of technology, it would have taken us months, if not years to manually collate the raw data and then run analytics on it. Even if we were to use technology, the few websites that had the data had not exposed APIs or RSS feeds so that we could extract the data. The only other viable (even though ugly) option was to extract data from various websites by running automated bots and then manually work towards verifying and wellformedness of the validity of the data.

In fact we all have come across websites that have put Recaptcha in their registration pages to avoid such web bots from registering. Recaptcha is a simple technology that helps differentiate between humans and bots.

Now suddenly we were on the other side of paradigm trying to run bots to extract data. The web bot would be given a link. This link would ideally represent the page 1 of a certain search result. The web bot would traverse through that page finding links. For each link found, it would open a webpage and then from that webpage it would extract data and store in the database. Finally, the bot would find the next pagination page and re do the task.

First, we got in touch with a friend for advise. He works for a leading financial news and data aggregator. He introduced us to stuff like Linux's WGET to fetch the webpage and W3C's TIDY to clean up the HTML tags. Additionally, he spoke about parsing the web page using either XSLT, XPATH or REGEX. Finally, he mentioned that one should store links in the database as Hash instead of links to optimize search queries. It was a great starting point.

However, Samhita's complete IT infrastructure is built on Ruby, so we needed Ruby equivalents of the same.

Step 1 - Instead of wget we used open-uri which is Ruby's defacto to fetch urls

Step 2 - Identify CSS selectors on the webpage you want to suck Initially we started off using FireQuark (http://www.quarkruby.com/2007/9/5/firequark-quick-html-screen-scraping). FireQuark is built on top of Firebug (1.4.1) and FireFox 3.5. Hence you might need to downgrade your FF. Later, we quickly realized there was a neater option. SelectorGadget (http://www.selectorgadget.com/) was not only browser independent but also gives cleaner unique CSS selector than FireQuark thought playing around with SelectorGadget's UI can be painful.

Step 3 - Finally to parse the webpages and suck the data we needed a HTML / CSS / XML parser. The options were Scrapi and Nokogiri. Nokogiri was way faster and had a cleaner interface.

Using these technologies, you too can extract data from public webpages. A word of caution though, please read the copyright and terms and condition of the website to ensure you are allowed to automatically extract data from that website or not.

Previous:husk