Home > slinky

slinky

Slinky is a project mainly written in PYTHON and R, it's free.

Slinky, a high-performance web crawler / text analytics in Python, Redis, Hadoop, R, Gephi

Download

Copyright (C) 2010, Paco Nathan. This work is licensed under

the BSD License. To view a copy of this license, visit:

http://creativecommons.org/licenses/BSD/

or send a letter to:

Creative Commons, 171 Second Street, Suite 300

San Francisco, California, 94105, USA

@author Paco Nathan [email protected]

Slinky provides an open source, high-performance Web Crawler, plus common Text Analytics, implemented in Python.

uses Redis key/value store for both CrawlQueue and PageStore
uses SQLite to persist crawled URI content
uses Neo4j to persist and analyze URI metadata
uses Hadoop, R, Gephi for Text Analytics and Link Analytics

This leverages a "Particle Cluster" design pattern. In contrast to MapReduce, a Particle Cluster is particularly well-suited for combinging highly reliable servers plus low-cost/unreliable VMs. In other words, you can take advantage of CPU + memory + I/O on availably but relatively ephemeral resources -- which might get taken away without notice. For example in AWS, the key/value store could run on a large EC2 node, while the distributed tasks run on Spot Instances -- based on pricing and availability. This pattern helps maximize throughput and reliability while minimizing the cost of scale-out for long-running jobs.

Required installs for worker nodes:

http://github.com/andymccurdy/redis-py
http://www.crummy.com/software/BeautifulSoup/
http://components.neo4j.org/neo4j.py/
http://jpype.sourceforge.net/
http://henry.precheur.org/python/rfc3339 (already included)

Additional required installs for server nodes:

http://code.google.com/p/redis/downloads
http://www.sqlite.org/download.html
http://neo4j.org/download/

Usage:

initialize Redis; run on server node...

cd PATH_TO_REDIS
nohup ./redis-server &
# you probably want to config so it does "BGSAVE"

# edit "config.tsv" for your settings...
# e.g., Slinky handles ~100 crawler threads/node, but not in default config

# initialize CrawlQueue and PageStore; run from any node...
./src/slinky.py redis_host:port:db flush
./src/slinky.py redis_host:port:db config < config.tsv
./src/slinky.py redis_host:port:db whitelist < whitelist.tsv
./src/slinky.py redis_host:port:db seed < urls.tsv

# perform a crawl; run this on each worker node...
nohup ./src/slinky.py redis_host:port:db perform &
# will poll/sleep indefinitely; use "kill -9 PID" to terminate

# persist the crawled URI content; run from any reliable node with attached storage...
nohup ./src/slinky.py redis_host:port:db persist &
# will poll/sleep indefinitely; use "kill -2 PID" to close with no data loss

# analyze the crawled URI metadata; run from any reliable node with attached storage...
nohup ./src/slinky.py redis_host:port:db analyze &
# will poll/sleep indefinitely; use "kill -2 PID" to close with no data loss

Previous：hpiwowar.github.com

Next：Graffiti-Analysis-Sculpture