Home > solr-feeder

solr-feeder

Solr-feeder is a project mainly written in ..., it's free.

Little script to feed documents to Solr from a set of files.

= Solr Feeder

Little script to feed documents to Solr from a set of files.

== Example

require 'solr-feeder' require 'hpricot'

SolrFeeder(ARGV) do |filepath|

 xml = open(filepath) { |f| Hpricot.XML(f) }

 # id and url
 id = xml.search('ARTICLE/ARTICLEURL').first.to_plain_text
 add_field 'id', id

 add_field 'text', xml.search('ARTICLE/BODY').first.to_plain_text
 add_field 'title', xml.search('ARTICLE/HEADLINE').first.to_plain_text

end

Save this as articles.rb and call with these arguments: ruby articles.rb --dir /path/to/files --core test --port 8983 --host localhost

This will process all files from dir and send them to the 'test' core of the Solr instance running on localhost:8983.

Each file will be processed by the block in articles.rb. The files can be whatever you like since you control the parsing of those files.

A commit to Solr is sent when all files are processed. You can also choose to send a commit at a specified interval of documents.

Within the block, you have access to an +add_field+ method which allows to add a field to the document that will be sent to Solr. The first parameter to +add_field+ is the name of the field. The second one is the value for this field. For now, it must be a String.

There is also an +add_param+ method available to pass extra parameters to Solr.

If you wanna prevent a file to be sent to Solr, just don't make any call to the +add_field+ method in the block and that file will be skipped.

SolrFeeder(ARGV) do |filepath| next if filepath =~ /.pdf$/

  # Processing for other files...

end

In the previous example, all files ending with .pdf will be skipped because no call to +add_field+ were made for them.

SolrFeeder can also handle tar.gz (and tgz) archives. You need to pass the --archives options and each archive encountered will be extracted and processed. Note that if you do not activate the --recursive switch while using --archives, only the files at the root of the archive will be processed.

== Command-line options [+dir+] Directory where the files to process are. Mandatory [+core+ CORE] Solr core to use. Mandatory [+host+ HOST] Solr host to connect to. Default to 'localhost' [+port+ PORT] Solr port on host. Default to '8983' [+recursive+] Process files in sub directories. Default off [+archives+] Process files in tar.gz and tgz archives. Default off [+max+ NUM] Max files to process from dir. Optional [+commit+ NUM] Send a commit at each nth file. Optional [+help+] Show usage information.

== Installation

Clone this repository and install the gem.

git clone [email protected]:pascaldimassimo/solr-feeder.git cd solr-feeder rake install

== Requirements

  • rsolr
Previous:itunes-habit