Home > Publico-Newspaper-Crawler-for-Portuguese-Corpus-Creation

Publico-Newspaper-Crawler-for-Portuguese-Corpus-Creation

Publico-Newspaper-Crawler-for-Portuguese-Corpus-Creation is a project mainly written in Ruby, it's free.

A crawler that accesses the portuguese Publico newspaper (publico.pt), gets all the news links it can, and from those news pages it tries to reach even more news, grabbing all their text in the meantime creating a portuguese corpus

h1. Publico Newspaper Crawler for Portuguese corpus creation

The aim of this Ruby script is to crawl the "portuguese Publico newspaper":http://publico.pt getting all the news available from the start page, extracting their text, and from those get their links to even more news.

h2. Dependencies/Requirements

  • Hpricot
  • Curb
  • Ruby 1.9.2. (will definitely work on other versions, but was developed on this one)

h2. Technical Details

The script uses a queue and a set for keeping record of the news links to process and the ones already processed, respectively.

Processing the news links will be 15 threads. They will open http connections to them, extract the news text (cleaning potential html tags inside), and get some more links to news to add to the queue.

The script is made in a way that if it, for whatever reason, processes all the links, the current set will be serialized, so that when you run the script again only new links will be processed.

h2. Corpus details

The news are added to the corpus text file exactly as they were on the news page, only without html tags inside. First line of paragraphs may add extra spaces to the beginning of the line on the corpus file.

h2. Running

Keep in mind that the script may potentially "never" end.

It can get approximately 1MB of text in 1 minute.


Hope this script may be useful Feel free to fork it