Threaded-Crawler is a project mainly written in Python, it's free.
Threaded Crawler
This web crawler is designed to be a generic and highly configurable crawler, that can quickly traverse sites, and pull content based on regex and other selection criteria.
Requirements
Uses BeatifulSoup to parse html pages (http://www.crummy.com/software/BeautifulSoup/) Uses epydoc for documentation Uses JobSite common package
python-psycopg2 2.0.8
Development
The 'cmd' script can be used to clean and build docs. Documentation is in doc/API.
INSTALL
python setup.py install
Running
$COMMON environment variable should be set to the path for common/patterns.py lib, or the lib should be installed on the default python path.