Commoncrawl is a project mainly written in ..., it's free.
CommonCrawl Hadoop Support Library
The start of the shared commoncrawl code repository.
Please set hadoop.version and hadoop.path in build.properties to point to your version of hadoop.
Once commoncrawl.jar has been built, you can execute a job/sample via the bin/launcher.sh script.
For example, to run the BasicArcFileReaderSample against the ARC file 2010/01/07/18/1262876244253_18.arc.gz in the main commoncrawl bucket, commoncrawl-crawl-002, you would run the following command line:
bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample <
The luancher runs the command in the background. You can monitor progress via either ./logs/<