Webcheck-ruby is a project mainly written in Ruby, based on the MIT license.
A website crawler / validator, designed to be integrated into an automated website integrity test. Inspired by the Python "Webcheck" project (http://arthurdejong.org/webcheck/)
= Webcheck-ruby
Webcheck-ruby is a simple web site crawler / checker, inspired by the Python program Webcheck[http://arthurdejong.org/webcheck/], currently maintained by Arthur de Jong
== Features
== Usage
To crawl a web page:
require 'webcheck' checker = Webcheck.new results = checker.check("http://example.com") print results[:uris404] print results[:urisUnknown] checker.prettyPrint(results)
The default format of the results is a hash with the following keys
[uris404] Links that returned a 404 error [uris200] Links that were followed successfully [urisUnknown] Links that returned an unknown HTTP code. These are in the form of hashes with the keys :code,:uri [checked] A hash of all the URIs checked. The keys are the URIs [urisNonHTTP] Links on the page that led to resources not accessed via the HTTP protocol (such as e-mail and FTP)
== Components
[Webcrawler] A recursive crawling tool. At each step, it retrieves a resource from a URL and feeds it to a block. The block yields links to retrieve on subsequent steps. [ConsistencyChecker] The validation engine. Takes the HTML bodies retrieved by the crawler and determines how to handle each one. [Linkfinder] A link extraction tool, which takes HTML bodies and extracts URLs from them.
== Dependencies
The Linkfinder uses Nokogiri[http://nokogiri.org] to extract links from web pages.
== Author
Copyright (c) 2010 by {Mark T. Tomczak}[http://fixermark.com], released under the MIT license.