webval | |||||
---|---|---|---|---|---|
SummaryA URL scanner, maintainer, and validator. Overviewwebval is a system that will scan documents for fully-qualified HTTP URLs, keeping its database fresh with newly-seen URLs. It can then be requested to validate the URLs, whereby it will attempt to access each URL via an HTTP request and record the response code; it maintains a list of the most recent codes that have been retrieved. Response codes are classified as "good" (URL is correct and a valid page is there) and "bad" (URL is invalid or outdated). By default any code other than a 2xx code is considered bad, but this can be changed (e.g., to ignore 3xx redirection codes). webval can then be used in report mode where it will scan documents for URLs as before, but will report invalid URLs (that is, URLs in the database which have a number of "bad" codes exceeding a certain threshhold). These are then printed to stderr in a format that shows the file and line number the URLs were seen in so that they can be corrected. webval's reporting output is designed to be GNU make friendly; the database itself is a simple text file, containing one record per line, which can be easily grepped and manipulated manually. Getting the softwareThe current version of webval is 1.0.1. The latest version of the software is available in a tarball here: http://www.alcyone.com/pyos/webval/webval-latest.tar.gz. The official URL for this Web site is http://www.alcyone.com/pyos/webval/. RequirementsPython 2.2 or greater is required. Threading is used by the validator, so Python must be configured with threads. LicenseThis code is released under the GPL. Sample usagewebval has three main modes: scan (the default), report, and validate. Scan and report behave the same, except that in report mode, encountered URLs that have bad codes exceeding the specified threshhold are printed to stderr with an indication of the file and line number they were encountered in; scan does not print any output; that is, you can use scan when all you wish to do is add URLs to the database and don't care about their status. In both scanning and reporting, the timestamp of when URLs were last seen is updated, and any new URLs are added to the database. To run webval for the first time: ./webval.py bookmarks.html This will run in scan mode ( ./webval.py -v which will check the URLs in the database in batches and print
status as it goes. This status can be suppressed with the ./webval.py -q -v Reporting can be done with the ./webval.py -r bookmarks.html By default, webval maintains the last five validations, and
reports URLs as bad when they have failed all five. These can be
changed with the ./webval.py -k 10 -v ./webval.py -t 5 -r bookmarks.html would keep 10 results and report an error if five or more were bad. When validating, each URL in the database is scanned, its HTTP response code collected, and the database is updated. Only a certain number of codes are kept at any given time; the most recent N codes are maintained. By default, any code of 300 or greater is considered "bad"; 300
includes the redirection codes, which do not always indicate a bad
resource. This can be changed with the ./webval.py -b 400 -r *.html would not count the 3xx codes as being errors. The ./webval.py -I yahoo.com,google.com bookmarks.html Other options are available; see below. InvocationThe following options can be used to alter the behavior of webval:
ReferencesHTTP response codes are the standard ones specified by the RFC, with the addition of 9xx response codes that are used by webval to indicate metaerrors (e.g., malformed URL, DNS lookup failure, TCP/IP error, etc.). The codes are reproduced here as a convenience:
Wish list
Release history
AuthorThis module was written by Erik Max Francis. If you use this software, have suggestions for future releases, or bug reports, I'd love to hear about it. VersionVersion 1.0.1 $Date: 2003/08/23 $ $Author: max $
|