Home > commoncrawl

commoncrawl

Commoncrawl is a project mainly written in ..., it's free.

CommonCrawl Hadoop Support Library

The start of the shared commoncrawl code repository.

Please set hadoop.version and hadoop.path in build.properties to point to your version of hadoop.

Once commoncrawl.jar has been built, you can execute a job/sample via the bin/launcher.sh script.

For example, to run the BasicArcFileReaderSample against the ARC file 2010/01/07/18/1262876244253_18.arc.gz in the main commoncrawl bucket, commoncrawl-crawl-002, you would run the following command line:

bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample <> <> commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz

The luancher runs the command in the background. You can monitor progress via either ./logs/<>.log for LOG output, or ./logs/<>_run.log for stdout output.