Home > webcheck-ruby

webcheck-ruby

Webcheck-ruby is a project mainly written in Ruby, based on the MIT license.

A website crawler / validator, designed to be integrated into an automated website integrity test. Inspired by the Python "Webcheck" project (http://arthurdejong.org/webcheck/)

= Webcheck-ruby

Webcheck-ruby is a simple web site crawler / checker, inspired by the Python program Webcheck[http://arthurdejong.org/webcheck/], currently maintained by Arthur de Jong

== Features

  • fully automated
  • identifies broken links, missing images, missing linked resources, and missing scripts
  • modular design allows for substitution of HTML link extractor, crawler, and page-validation rules

== Usage

To crawl a web page:

require 'webcheck' checker = Webcheck.new results = checker.check("http://example.com") print results[:uris404] print results[:urisUnknown] checker.prettyPrint(results)

The default format of the results is a hash with the following keys

[uris404] Links that returned a 404 error [uris200] Links that were followed successfully [urisUnknown] Links that returned an unknown HTTP code. These are in the form of hashes with the keys :code,:uri [checked] A hash of all the URIs checked. The keys are the URIs [urisNonHTTP] Links on the page that led to resources not accessed via the HTTP protocol (such as e-mail and FTP)

== Components

[Webcrawler] A recursive crawling tool. At each step, it retrieves a resource from a URL and feeds it to a block. The block yields links to retrieve on subsequent steps. [ConsistencyChecker] The validation engine. Takes the HTML bodies retrieved by the crawler and determines how to handle each one. [Linkfinder] A link extraction tool, which takes HTML bodies and extracts URLs from them.

== Dependencies

The Linkfinder uses Nokogiri[http://nokogiri.org] to extract links from web pages.

== Author

Copyright (c) 2010 by {Mark T. Tomczak}[http://fixermark.com], released under the MIT license.