Home > creepy

creepy

Creepy is a project mainly written in PYTHON and RUBY, it's free.

Python Web Crawler for CS453

= Creepy

== DESCRIPTION Creepy is a simple search engine. It will be developed in four stages: (i) web crawler; (ii) data cleanser; (iii) indexer; and (iv) query processor. Each of the stages are defined by components, some of which are accessible through the Creepy interface, some of which are standalone components.

== RUNNING $ creepy.py [options]

===Creepy Search Engine

Options: --version show program's version number and exit -h, --help show this help message and exit -v, --verbose Verbose Output [default: False] -d, --debug Debug Mode [default: False] -S STORAGE, --storage=STORAGE Data Storage location [default: ./storage/] -C, --clean_start Start clean by deleting existing storage location [default: False]

Web Crawler: Crawling the Web: Takes set of seed urls and start crawling the web and stores the pages in the storage location.

-c <seeds | seedfile>, --crawl=<seeds | seedfile>
                    Crawl the web
-T THRESHOLD, --page_threshold=THRESHOLD
                    Max. number of pages to crawl [default: 2500]
-N NUM_THREADS, --num_threads=NUM_THREADS
                    Number of threads to use [default: 1]

Data Cleanser: Data Preparation/Cleansing: Reads pages stored by the crawler and perform requested actions.

-s, --strip_tags    HTML Tag Stripping
-t, --tokenize      Tokenize the document

===XML Document Graph

$ rucksack.rb [options]

Options: -o, --output FILE Output resultant XML to FILE -i, --indent N Indent N spaces Default: 2

-a, --append DIR Append DIR to the pid_map path --delimiter DEL pid_map uses a delimiter DEL to separate URL and ID Default: "=>"

===XML Validation

$ validate.rb

== REQUIRES

  • Python 2.6.1
  • Ruby 1.8.7 or higher
    • Gem: hpricot
    • Gem: builder
    • Gem: libxml-ruby
  • libxml2

== LOCATION http://github.com/doggles/creepy

== AUTHORS

Travis Hall, [email protected]

Brittany Thompson, [email protected]

Bhadresh Patel, [email protected]