Charyb is a project mainly written in JAVASCRIPT and RUBY, it's free.
a crawler for data in ruby
h2. What is Charyb?
Charyb is a tool to help suck down data and shove them into a simple datawarehouse.
There are two parts, the web interface and the crawler.
The web interface makes it easier to suck in data from different mime types.
Because data scraping isn't yet fully automated, there needs to be a little bit
of human intervention.
The crawler makes use of the data humans entered to figure out how to scrape the data. It will keep checking those data sources as updates.
h2. Installation
if you haven't added the github source, do it by doing:
sudo gem source -a http://gems.github.com sudo gem source -a http://gems.rubyforge.org
The install the following packages:
libopenssl-ruby
Then install the following gems:
sudo gem install -d sinatra -v 0.9.4 sudo gem install -d hpricot -v 0.8.1
sudo gem install -d sqlite3-ruby -v 1.2.5 # only for sqlite3
sudo gem install -d rack-test -v 0.4.2 sudo gem install -d webrat -v 0.5.3 sudo gem install -d thoughtbot-shoulda -v 2.10.2
sudo gem install -d mocha -v 0.9.8
sudo gem install -d notahat-machinist -v 1.0.3 sudo gem install -d faker -v 0.3.1
h2. Setup
You need to pull in the git submodules first
git submodule init git submodule update
If it tells you that you can access it, ping the owner of the repo to add you as a collaborator
Then you must set up the database.
rake db:schema:load
h2. To run it
To run the web interface:
rake web:run
And then go to http://localhost:4567
To run the crawler:
rake crawler:run
h2. Architecture
There are two major components: The web interface and the crawler. The web interface is used to for the human in the loop to more easily say how to suck in data for a datasource. Right now, it only does html tables. Eventually, it'll also do CSV files as well. It stores this information in the SourceTracker, which is a wrapper around an SQLite database.
The crawler is run separately and uses the SourceTracker's information about where a datasource can be found and how to suck it in.
When the crawler sucks in the data and cleaned it properly, it then pushes it to the redis datawarehouse. Then teabag queries the datawarehouse when it gets web requests from the outside world.
The web part starts with src/application.rb it requires 'config/initialize' at the top, which sets all the global constants and paths. you can find it in config/initialize.
Then the rest of that code is in sinatra's DSL. I suggest reading http://www.sinatrarb.com/intro.html first, and then using http://www.sinatrarb.com/book.html as a reference.
The crawler starts with crawler.rb, but I haven't finished it yet. Right
now, the source tracker isn't finalized yet either, so the web interface
directly uses the activerecord objects. activerecord is an ORM object.
http://api.rubyonrails.org/classes/ActiveRecord/Base.html
If you have specific questions, don't hesitate to ask either email or by phone.
h2. Import Table Use Case
h2. Edit Table Use Case
h2. cli commands to look into redis In the following commands, all spaces are replaced by underscores. This is only for input. The result can contain spaces.
redis-cli smembers datasets lists datasets by name
redis-cli smembers "[dsname]||dimensions" lists the dimensions of dataset dsname.
redis-cli smembers "[dsname]||[dimname]" lists the values of the dimension dimname.
redis-cli smembers "[dsname]||meta" lists metadata for a dataset.
redis-cli get "[dsname]||[dimvalue]||[nextdimvalue]" returns a value.