Home > charyb

charyb

Charyb is a project mainly written in JAVASCRIPT and RUBY, it's free.

a crawler for data in ruby

h2. What is Charyb?

Charyb is a tool to help suck down data and shove them into a simple datawarehouse.

There are two parts, the web interface and the crawler.

The web interface makes it easier to suck in data from different mime types.
Because data scraping isn't yet fully automated, there needs to be a little bit of human intervention.

The crawler makes use of the data humans entered to figure out how to scrape the data. It will keep checking those data sources as updates.

h2. Installation

if you haven't added the github source, do it by doing:

sudo gem source -a http://gems.github.com sudo gem source -a http://gems.rubyforge.org

The install the following packages:

libopenssl-ruby

Then install the following gems:

for web application

sudo gem install -d sinatra -v 0.9.4 sudo gem install -d hpricot -v 0.8.1

sudo gem install -d couchrest -v 0.33

sudo gem install -d sqlite3-ruby -v 1.2.5 # only for sqlite3

for testing

sudo gem install -d rack-test -v 0.4.2 sudo gem install -d webrat -v 0.5.3 sudo gem install -d thoughtbot-shoulda -v 2.10.2

for mocks and stubs

sudo gem install -d mocha -v 0.9.8

for fixtures

sudo gem install -d notahat-machinist -v 1.0.3 sudo gem install -d faker -v 0.3.1

h2. Setup

You need to pull in the git submodules first

git submodule init git submodule update

If it tells you that you can access it, ping the owner of the repo to add you as a collaborator

Then you must set up the database.

rake db:schema:load

h2. To run it

To run the web interface:

rake web:run

And then go to http://localhost:4567

To run the crawler:

rake crawler:run

h2. Architecture

There are two major components: The web interface and the crawler. The web interface is used to for the human in the loop to more easily say how to suck in data for a datasource. Right now, it only does html tables. Eventually, it'll also do CSV files as well. It stores this information in the SourceTracker, which is a wrapper around an SQLite database.

The crawler is run separately and uses the SourceTracker's information about where a datasource can be found and how to suck it in.

When the crawler sucks in the data and cleaned it properly, it then pushes it to the redis datawarehouse. Then teabag queries the datawarehouse when it gets web requests from the outside world.

The web part starts with src/application.rb it requires 'config/initialize' at the top, which sets all the global constants and paths. you can find it in config/initialize.

Then the rest of that code is in sinatra's DSL. I suggest reading http://www.sinatrarb.com/intro.html first, and then using http://www.sinatrarb.com/book.html as a reference.

The crawler starts with crawler.rb, but I haven't finished it yet. Right now, the source tracker isn't finalized yet either, so the web interface directly uses the activerecord objects. activerecord is an ORM object.
http://api.rubyonrails.org/classes/ActiveRecord/Base.html

If you have specific questions, don't hesitate to ask either email or by phone.

h2. Import Table Use Case

  1. user starts the web interface (rake web:run) from a console.
  2. webserver starts, messages scroll by in the console.
  3. user opens a browser to http://localhost:4567
  4. browser opens showing the charyb front page.
  5. user enters the url of a page containing an html table into the textfield and clicks "submit."
  6. browser shows requested page in left div and list of tables imported from that page (none) and a button to add a new table in the right div.
  7. user clicks "Add a table."
  8. a form is displayed in the right div.
  9. user fills out the fields from "Table heading" to "Converter." Some fields have special formatting rules:
    • "col heading" or "row heading" can be "Category" if no word describes the collection of headings. Most of the time "row heading" will be "Year" or "State."
    • "other dims" should be a csv list of "dimension name, dimension value" pairs. For example, "Year, 2006"
    • "default" should match the value of the "row header" "col header" or one of the "other dims."
    • "published at" should be a date, eg. "01/01/2009."
    • "units" can be a single value or can be a csv list for each column.
    • "converter" can be blank, or a single value, or a csv list for each column. values must match a converter name. (not implemented yet)
  10. user clicks a cell at one corner of the column headers in the table on the left div.
  11. the cell is colored red.
  12. user shift-clicks a cell at the opposite corner of the column headers.
  13. the column headers are all colored red.
  14. the user clicks the "set" button next to "Col labels."
  15. the column headers are all colored green, their values are listed in the "Col labels" textarea, one per line.
  16. the user does the same for row headers and data.
  17. row headers are colored brown; data cells are colored purple. if there are n row headers and m col headers there must be nxm data values.
  18. user clicks the "Create" button.
  19. right div removes the form and lists the table as a link.

h2. Edit Table Use Case

  1. user starts the web interface (rake web:run) from a console.
  2. webserver starts, messages scroll by in the console.
  3. user opens a browser to http://localhost:4567
  4. browser opens showing the charyb front page.
  5. user clicks on a datasource title link.
  6. browser shows requested page in left div and list of tables imported from that page with edit buttons, and a button to add a new table in the right div.
  7. user clicks edit.
  8. browser shows partially completed form used to import the table.
  9. user clicks the "Finish Loading" button.
  10. browser selects cells in the table in the right div, coloring the table as it was when it was imported, and fills in the textareas with selected values in the form.
  11. user makes changes and clicks the "Update" button.
  12. right div removes the form and lists the table as a link.

h2. cli commands to look into redis In the following commands, all spaces are replaced by underscores. This is only for input. The result can contain spaces.

redis-cli smembers datasets lists datasets by name

redis-cli smembers "[dsname]||dimensions" lists the dimensions of dataset dsname.

redis-cli smembers "[dsname]||[dimname]" lists the values of the dimension dimname.

redis-cli smembers "[dsname]||meta" lists metadata for a dataset.

redis-cli get "[dsname]||[dimvalue]||[nextdimvalue]" returns a value.