Hivemeta is a project mainly written in Ruby, it's free.
Use the hadoop hive metadb to write map/reduce and easily query table info.
hivemeta
A ruby API for access to a Hive metastore running under MySQL.
Useful for querying columns in Hadoop map/reduce applications. Normally, a developer needs to handle both the splitting of incoming data and the assignment of numerically indexed fields to friendly variables like so:
fields = line.chomp.split /\t/ item_id = fields[0] inv_cnt = fields[7].to_i puts "#{item_id}\t#{inv_cnt}"
This is not overly traumatic, however it's susceptible to errors creeping in from file format changes. Ongoing maintenance can easily become a burden if there are many map/reduce programs reading the same changed data files. Code size increases as the column count increases.
With hivemeta, the process is streamlined. That same task is now:
row = inv_table.process_row line puts "#{row.item_id}\t#{row.inv_cnt.to_i}"
The row object automagically knows its column names and they can be referenced in one of the following ways (in order of best to worst performance and coolness):
row.col_name row[:col_name] row['col_name']
Also included is a demo application, hivemeta_query.rb, to spit out table information from the command-line via table name search or by the table's location in HDFS.
Installation
gem install hivemeta
or (for jruby)
jgem install hivemeta
API Usage
streaming map/reduce code snippet (abstracted processing loop):
require 'hivemeta'
h = HiveMeta::Connection.new # see below for detail
h.table('sample_inventory').process do |row| item_id = row.item_id # can access by method or [:sym] or ['str'] count = row.inv_cnt.to_i puts "#{item_id}\t#{count}" if count >= 1000 end
streaming map/reduce code snippet (normal STDIN processing loop):
require 'hivemeta'
h = HiveMeta::Connection.new # see below for detail inv_table = h.table 'sample_inventory'
STDIN.each_line do |line| begin row = inv_table.process_row line rescue HiveMeta::FieldCountError STDERR.puts "reporter:counter:bad_data:row_size,1" next end item_id = row.item_id # can access by method or [:sym] or ['str'] count = row.inv_cnt.to_i puts "#{item_id}\t#{count}" if count >= 1000 end
establishing a connection (in ruby code):
db_user = 'hive' db_pass = 'hivepasshere' db_host = 'localhost' db_name = 'hivemeta'
h = HiveMeta::Connection.new(db_name, db_host, db_user, db_pass)
establishing a connection (environment variables):
h = HiveMeta::Connection.new
hivemeta_query.rb Usage
$ hivemeta_query.rb join_test_name join_test_name hdfs://namenode/tmp/join_test_name delimiter: "\t" (ASCII 9) 0 userid # userid 1 name # username
$ hivemeta_query.rb join_test% join_test_address hdfs://namenode/tmp/join_test_address delimiter: "," (ASCII 44) 0 userid # uid 1 address 2 city 3 state
join_test_name hdfs://namenode/tmp/join_test_name delimiter: "\t" (ASCII 9) 0 userid # userid 1 name # username
$ hivemeta_query -l /tmp join_test_address join_test_work my_test_table
$ hivemeta_query.rb -h usage: ./hivemeta_query.rb [options] table_name|hdfs_path -h, --help -c, --comments # display comments along with field detail (default) -C, --no-comments # do not display comments with the field detail -l, --list-tables # list matching tables but no detail -f, --list-file-path # list the table HDFS file locations -w, --fit-width # fit the text to the width of the screen (default) -W, --no-fit-width # do not fit the text to the width of the screen -u, --db-user=arg # hive metastore db user (requires read access) -p, --db-pass=arg # hive metastore db password -H, --db-host=arg # host running the hive meta db (default: localhost) -d, --db-name=arg # hive meta db name (default: hivemeta)