Table of Contents
A URL scanner, maintainer, and validator.
webval is a system that will scan documents for fully-qualified HTTP URLs, keeping its database fresh with newly-seen URLs. It can then be requested to validate the URLs, whereby it will attempt to access each URL via an HTTP request and record the response code; it maintains a list of the most recent codes that have been retrieved. Response codes are classified as "good" (URL is correct and a valid page is there) and "bad" (URL is invalid or outdated). By default any code other than a 2xx code is considered bad, but this can be changed (e.g., to ignore 3xx redirection codes).
webval can then be used in report mode where it will scan documents for URLs as before, but will report invalid URLs (that is, URLs in the database which have a number of "bad" codes exceeding a certain threshhold). These are then printed to stderr in a format that shows the file and line number the URLs were seen in so that they can be corrected.
webval's reporting output is designed to be GNU make friendly; the database itself is a simple text file, containing one record per line, which can be easily grepped and manipulated manually.
Getting the software
The current version of webval is 1.0.1.
The latest version of the software is available in a tarball here: http://www.alcyone.com/pyos/webval/webval-latest.tar.gz.
The official URL for this Web site is http://www.alcyone.com/pyos/webval/.
Python 2.2 or greater is required. Threading is used by the validator, so Python must be configured with threads.
This code is released under the GPL.
webval has three main modes: scan (the default), report, and validate. Scan and report behave the same, except that in report mode, encountered URLs that have bad codes exceeding the specified threshhold are printed to stderr with an indication of the file and line number they were encountered in; scan does not print any output; that is, you can use scan when all you wish to do is add URLs to the database and don't care about their status. In both scanning and reporting, the timestamp of when URLs were last seen is updated, and any new URLs are added to the database.
To run webval for the first time:
This will run in scan mode (
which will check the URLs in the database in batches and print
status as it goes. This status can be suppressed with the
./webval.py -q -v
Reporting can be done with the
./webval.py -r bookmarks.html
By default, webval maintains the last five validations, and
reports URLs as bad when they have failed all five. These can be
changed with the
./webval.py -k 10 -v ./webval.py -t 5 -r bookmarks.html
would keep 10 results and report an error if five or more were bad.
When validating, each URL in the database is scanned, its HTTP response code collected, and the database is updated. Only a certain number of codes are kept at any given time; the most recent N codes are maintained.
By default, any code of 300 or greater is considered "bad"; 300
includes the redirection codes, which do not always indicate a bad
resource. This can be changed with the
./webval.py -b 400 -r *.html
would not count the 3xx codes as being errors.
./webval.py -I yahoo.com,google.com bookmarks.html
Other options are available; see below.
The following options can be used to alter the behavior of webval:
HTTP response codes are the standard ones specified by the RFC, with the addition of 9xx response codes that are used by webval to indicate metaerrors (e.g., malformed URL, DNS lookup failure, TCP/IP error, etc.). The codes are reproduced here as a convenience:
Version 1.0.1 $Date: 2003/08/23 $ $Author: max $
Table of ContentsThis document was automatically generated on Sat Aug 23 13:33:15 2003 by HappyDoc version 2.0.1