webval

Summary

A URL scanner, maintainer, and validator.

Overview

webval is a system that will scan documents for fully-qualified HTTP URLs, keeping its database fresh with newly-seen URLs. It can then be requested to validate the URLs, whereby it will attempt to access each URL via an HTTP request and record the response code; it maintains a list of the most recent codes that have been retrieved. Response codes are classified as "good" (URL is correct and a valid page is there) and "bad" (URL is invalid or outdated). By default any code other than a 2xx code is considered bad, but this can be changed (e.g., to ignore 3xx redirection codes).

webval can then be used in report mode where it will scan documents for URLs as before, but will report invalid URLs (that is, URLs in the database which have a number of "bad" codes exceeding a certain threshhold). These are then printed to stderr in a format that shows the file and line number the URLs were seen in so that they can be corrected.

webval's reporting output is designed to be GNU make friendly; the database itself is a simple text file, containing one record per line, which can be easily grepped and manipulated manually.

Getting the software

The current version of webval is 1.0.1.

The latest version of the software is available in a tarball here: http://www.alcyone.com/pyos/webval/webval-latest.tar.gz.

The official URL for this Web site is http://www.alcyone.com/pyos/webval/.

Requirements

Python 2.2 or greater is required. Threading is used by the validator, so Python must be configured with threads.

License

This code is released under the GPL.

Sample usage

webval has three main modes: scan (the default), report, and validate. Scan and report behave the same, except that in report mode, encountered URLs that have bad codes exceeding the specified threshhold are printed to stderr with an indication of the file and line number they were encountered in; scan does not print any output; that is, you can use scan when all you wish to do is add URLs to the database and don't care about their status. In both scanning and reporting, the timestamp of when URLs were last seen is updated, and any new URLs are added to the database.

To run webval for the first time:

        ./webval.py bookmarks.html

This will run in scan mode (-s). To then validate these URLs, use:

        ./webval.py -v

which will check the URLs in the database in batches and print status as it goes. This status can be suppressed with the -q option (e.g., for a cron job):

        ./webval.py -q -v

Reporting can be done with the -r option:

        ./webval.py -r bookmarks.html

By default, webval maintains the last five validations, and reports URLs as bad when they have failed all five. These can be changed with the -k and -t options, respectively:

        ./webval.py -k 10 -v
        ./webval.py -t 5 -r bookmarks.html

would keep 10 results and report an error if five or more were bad.

When validating, each URL in the database is scanned, its HTTP response code collected, and the database is updated. Only a certain number of codes are kept at any given time; the most recent N codes are maintained.

By default, any code of 300 or greater is considered "bad"; 300 includes the redirection codes, which do not always indicate a bad resource. This can be changed with the -b option:

        ./webval.py -b 400 -r *.html

would not count the 3xx codes as being errors.

The -I option will ignore URLs containing certain fragments:

        ./webval.py -I yahoo.com,google.com bookmarks.html

Other options are available; see below.

Invocation

The following options can be used to alter the behavior of webval:

-b/--bad-code (code): The numerically lowest code that is considered "bad" for the sake of reporting. By default this is 300; set it to 400 if you do not wish to consider redirections as errors. In conjunction with -r.
-h/--help: Print usage and exit.
-j/--jobs (threads): The number of simultaneous threads to have actively checking URLs while validating. Defaults to 10. In conjunction with -V.
-k/--codes (count): The number of codes to maintain in the database. By default this is 5. In conjunction with -V.
-q/--quiet: Suppress printing status while validating; this will validate silently and no output will generated to stdout. In conjunction with -V.
-r/--report: Report mode, where URLs are scanned and added, but are also checked against the database; bad URLs are printed with the file and line number they were found in.
-s/--scan: Go into scan mode, where URLs in the specified files are silently scanned and added to the database. This is the default mode.
-t/--threshhold (count): The number of codes which must be considered "bad" before a URL is reported as being invalid. By default this is 5, the same as the number of maintained codes (i.e., 100% of the validations must have resulted in a bad code for the URL to be considered invalid). In conjunction with -r.
-v/--version: Print version and exit.
-A/--agent (agent): The User-Agent that webval will report itself as when contacting remote hosts via HTTP. Defaults to WebVal/1.0.1. In conjunction with -V.
-I/--ignore (strings): Ignore URLs containing one of the comma-separated list of strings provided. This will only preclude the inclusion of new URLs containing the substrings; URLs already in the database matching them will not be removed. In conjunction with -s or -r.
-M/--method (method): The HTTP method to use when contacting hosts; should be HEAD or GET. By default HEAD is used; GET involves more bandwidth usage, but HEAD is more obviously associated with a URL scan. In conjunction with -V.
-T/--timeout (seconds): The per-thread timeout to use when scanning URLs. Any server which takes longer than this to complete will be logged with a 999 code. In conjunction with -V.
-V/--validate: Go through the database and check each URL manually, updating its status in the database.

References

HTTP response codes are the standard ones specified by the RFC, with the addition of 9xx response codes that are used by webval to indicate metaerrors (e.g., malformed URL, DNS lookup failure, TCP/IP error, etc.). The codes are reproduced here as a convenience:

1xx Informational: 100 Continue; 101 Switching Protocols;
2xx Successful: 200 OK; 201 Created; 202 Accepted; 203 Non-Authoritative Information; 204 No Content; 205 Reset Content; 206 Partial Content;
3xx Redirection: 300 Multiple Choices; 301 Moved Permanently; 302 Found; 303 See Other; 304 Not Modified; 305 Use Proxy; 306 (Unused); 307 Temporary Redirect;
4xx Client Error: 400 Bad Request; 401 Unauthorized; 402 Payment Required; 403 Forbidden; 404 Not Found; 405 Method Not Allowed; 406 Not Acceptable; 407 Proxy Authentication Required; 408 Request Timeout; 409 Conflict; 410 Gone; 411 Length Required; 412 Precondition Failed; 413 Request Entity Too Large; 414 Request-URI Too Long; 415 Unsupported Media Type; 416 Requested Range Not Satisfiable; 417 Expectation Failed;
5xx Server Error: 500 Internal Server Error; 501 Not Implemented; 502 Bad Gateway; 503 Service Unavailable; 504 Gateway Timeout; 505 HTTP Version Not Supported;
9xx webval Metaerror nonstandard: 901 Socket error; 902 TCP/IP I/O error; 903 Bad port; 999 Timeout;

Wish list

Following URLs for 3xx codes would be useful, with special notations for "URL changed but still valid" and "URL changed and no longer valid anyway."
An additional mode to expire URLs which haven't been scnaned in a while.

Release history

1.0.1; 2003 Aug 23. Swap -v (validate) and -V (version information) options.
1.0; 2002 Aug 18. Initial release.

Author

This module was written by Erik Max Francis. If you use this software, have suggestions for future releases, or bug reports, I'd love to hear about it.

Version

Version 1.0.1 $Date: 2003/08/23 $ $Author: max $

Modules and Packages

webval

A URL scanner and validator.

Table of Contents

This document was automatically generated on Sat Aug 23 13:33:15 2003 by HappyDoc version 2.0.1