Home > hivemeta

hivemeta

Hivemeta is a project mainly written in Ruby, it's free.

Use the hadoop hive metadb to write map/reduce and easily query table info.

hivemeta

A ruby API for access to a Hive metastore running under MySQL.

Useful for querying columns in Hadoop map/reduce applications. Normally, a developer needs to handle both the splitting of incoming data and the assignment of numerically indexed fields to friendly variables like so:

fields = line.chomp.split /\t/ item_id = fields[0] inv_cnt = fields[7].to_i puts "#{item_id}\t#{inv_cnt}"

This is not overly traumatic, however it's susceptible to errors creeping in from file format changes. Ongoing maintenance can easily become a burden if there are many map/reduce programs reading the same changed data files. Code size increases as the column count increases.

With hivemeta, the process is streamlined. That same task is now:

row = inv_table.process_row line puts "#{row.item_id}\t#{row.inv_cnt.to_i}"

The row object automagically knows its column names and they can be referenced in one of the following ways (in order of best to worst performance and coolness):

row.col_name row[:col_name] row['col_name']

Also included is a demo application, hivemeta_query.rb, to spit out table information from the command-line via table name search or by the table's location in HDFS.


Installation

gem install hivemeta

or (for jruby)

jgem install hivemeta


API Usage

streaming map/reduce code snippet (abstracted processing loop):

require 'hivemeta'

h = HiveMeta::Connection.new # see below for detail

h.table('sample_inventory').process do |row| item_id = row.item_id # can access by method or [:sym] or ['str'] count = row.inv_cnt.to_i puts "#{item_id}\t#{count}" if count >= 1000 end

streaming map/reduce code snippet (normal STDIN processing loop):

require 'hivemeta'

h = HiveMeta::Connection.new # see below for detail inv_table = h.table 'sample_inventory'

STDIN.each_line do |line| begin row = inv_table.process_row line rescue HiveMeta::FieldCountError STDERR.puts "reporter:counter:bad_data:row_size,1" next end item_id = row.item_id # can access by method or [:sym] or ['str'] count = row.inv_cnt.to_i puts "#{item_id}\t#{count}" if count >= 1000 end

establishing a connection (in ruby code):

db_user = 'hive' db_pass = 'hivepasshere' db_host = 'localhost' db_name = 'hivemeta'

h = HiveMeta::Connection.new(db_name, db_host, db_user, db_pass)

establishing a connection (environment variables):

when no arguments are passed, the following env variables will be used:

hivemeta_db_host

hivemeta_db_name

hivemeta_db_user

hivemeta_db_pass

to set these in a streaming map/reduce job, use -D arguments like so:

-D hivemeta.db_host=mydbhost \

-D hivemeta.db_name=hivemeta \

-D hivemeta.db_user=hive \

-D hivemeta.db_pass=mydbpass \

the connection will made with those env variables without any other code

h = HiveMeta::Connection.new


hivemeta_query.rb Usage

query by table names

$ hivemeta_query.rb join_test_name join_test_name hdfs://namenode/tmp/join_test_name delimiter: "\t" (ASCII 9) 0 userid # userid 1 name # username

query by table name wildcards

$ hivemeta_query.rb join_test% join_test_address hdfs://namenode/tmp/join_test_address delimiter: "," (ASCII 44) 0 userid # uid 1 address 2 city 3 state

join_test_name hdfs://namenode/tmp/join_test_name delimiter: "\t" (ASCII 9) 0 userid # userid 1 name # username

list the tables using /tmp in HDFS

$ hivemeta_query -l /tmp join_test_address join_test_work my_test_table

view usage information

$ hivemeta_query.rb -h usage: ./hivemeta_query.rb [options] table_name|hdfs_path -h, --help -c, --comments # display comments along with field detail (default) -C, --no-comments # do not display comments with the field detail -l, --list-tables # list matching tables but no detail -f, --list-file-path # list the table HDFS file locations -w, --fit-width # fit the text to the width of the screen (default) -W, --no-fit-width # do not fit the text to the width of the screen -u, --db-user=arg # hive metastore db user (requires read access) -p, --db-pass=arg # hive metastore db password -H, --db-host=arg # host running the hive meta db (default: localhost) -d, --db-name=arg # hive meta db name (default: hivemeta)

Previous:bugtracker