Object-scraper is a project mainly written in Ruby, it's free.
Recipe like object extraction from HTML sources
= Object Scraper
== Description
Object scraper is a thin wrapper for hpricot to enable receipt-like extraction of ruby objects from various web sites.
== Install
=== Gem
gem install object-scraper --source http://gemcutter.org
=== Rails
config.gem 'object-scraper', :source => 'http://gemcutter.org'
== Example
class Entry < Object attr_accessor :text, :date end
uri = "http://twitter.com/twitter" pattern = ".status"
Scraper.define(:twitter, :class => :entry, :source => uri, :node => pattern) do |s| s.text { |node| node.at(".entry-content").inner_html } s.date { |node| DateTime.parse(node.at(".timestamp")[:data][/'.*'/].delete("'")) } end
@objects = Scraper.parse(:twitter)
If you define multiple scrapers, you can collect all their objects with one simple method
@objects = Scraper.parse_all
== Advanced Example
It is possible to use other existing HTML parsers instead of hpricot. Just overwrite the according proc object.
require 'nokogiri' Scraper.scrape_source_with = Proc.new { |source| Nokogiri::HTML(source) }
Scraper.define(:twitter, :class => :entry, :source => uri, :node => pattern) do |s|
end
== Rails
All scraper definitions sitting in RAILS_ROOT/scrapers will be taken into account automatically when you use object-scraper as a gem in your rails project.
== Author