Solr-feeder is a project mainly written in ..., it's free.
Little script to feed documents to Solr from a set of files.
= Solr Feeder
Little script to feed documents to Solr from a set of files.
== Example
require 'solr-feeder' require 'hpricot'
SolrFeeder(ARGV) do |filepath|
xml = open(filepath) { |f| Hpricot.XML(f) }
# id and url
id = xml.search('ARTICLE/ARTICLEURL').first.to_plain_text
add_field 'id', id
add_field 'text', xml.search('ARTICLE/BODY').first.to_plain_text
add_field 'title', xml.search('ARTICLE/HEADLINE').first.to_plain_text
end
Save this as articles.rb and call with these arguments: ruby articles.rb --dir /path/to/files --core test --port 8983 --host localhost
This will process all files from dir and send them to the 'test' core of the Solr instance running on localhost:8983.
Each file will be processed by the block in articles.rb. The files can be whatever you like since you control the parsing of those files.
A commit to Solr is sent when all files are processed. You can also choose to send a commit at a specified interval of documents.
Within the block, you have access to an +add_field+ method which allows to add a field to the document that will be sent to Solr. The first parameter to +add_field+ is the name of the field. The second one is the value for this field. For now, it must be a String.
There is also an +add_param+ method available to pass extra parameters to Solr.
If you wanna prevent a file to be sent to Solr, just don't make any call to the +add_field+ method in the block and that file will be skipped.
SolrFeeder(ARGV) do |filepath| next if filepath =~ /.pdf$/
# Processing for other files...
end
In the previous example, all files ending with .pdf will be skipped because no call to +add_field+ were made for them.
SolrFeeder can also handle tar.gz (and tgz) archives. You need to pass the --archives options and each archive encountered will be extracted and processed. Note that if you do not activate the --recursive switch while using --archives, only the files at the root of the archive will be processed.
== Command-line options [+dir+] Directory where the files to process are. Mandatory [+core+ CORE] Solr core to use. Mandatory [+host+ HOST] Solr host to connect to. Default to 'localhost' [+port+ PORT] Solr port on host. Default to '8983' [+recursive+] Process files in sub directories. Default off [+archives+] Process files in tar.gz and tgz archives. Default off [+max+ NUM] Max files to process from dir. Optional [+commit+ NUM] Send a commit at each nth file. Optional [+help+] Show usage information.
== Installation
Clone this repository and install the gem.
git clone [email protected]:pascaldimassimo/solr-feeder.git cd solr-feeder rake install
== Requirements