Creepy is a project mainly written in PYTHON and RUBY, it's free.
Python Web Crawler for CS453
= Creepy
== DESCRIPTION Creepy is a simple search engine. It will be developed in four stages: (i) web crawler; (ii) data cleanser; (iii) indexer; and (iv) query processor. Each of the stages are defined by components, some of which are accessible through the Creepy interface, some of which are standalone components.
== RUNNING $ creepy.py [options]
===Creepy Search Engine
Options: --version show program's version number and exit -h, --help show this help message and exit -v, --verbose Verbose Output [default: False] -d, --debug Debug Mode [default: False] -S STORAGE, --storage=STORAGE Data Storage location [default: ./storage/] -C, --clean_start Start clean by deleting existing storage location [default: False]
Web Crawler: Crawling the Web: Takes set of seed urls and start crawling the web and stores the pages in the storage location.
-c <seeds | seedfile>, --crawl=<seeds | seedfile>
Crawl the web
-T THRESHOLD, --page_threshold=THRESHOLD
Max. number of pages to crawl [default: 2500]
-N NUM_THREADS, --num_threads=NUM_THREADS
Number of threads to use [default: 1]
Data Cleanser: Data Preparation/Cleansing: Reads pages stored by the crawler and perform requested actions.
-s, --strip_tags HTML Tag Stripping
-t, --tokenize Tokenize the document
===XML Document Graph
$ rucksack.rb [options]
Options: -o, --output FILE Output resultant XML to FILE -i, --indent N Indent N spaces Default: 2
-a, --append DIR Append DIR to the pid_map path --delimiter DEL pid_map uses a delimiter DEL to separate URL and ID Default: "=>"
===XML Validation
$ validate.rb
== REQUIRES
== LOCATION http://github.com/doggles/creepy
== AUTHORS
Travis Hall, [email protected]
Brittany Thompson, [email protected]
Bhadresh Patel, [email protected]