next up previous contents index
Next: Example RootNode configuration Up: 4 The Gatherer Previous: 4.2 Basic setup

4.3 RootNode specifications

                   

The RootNode specification facility described in Section 4.2 provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits -- for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. Starting with Harvest Version 1.1, it is possible to specify these and other aspects of enumeration, using the following syntax (which is backwards-compatible with Harvest Version 1.0):

        <RootNodes>
        URL EnumSpec
        URL EnumSpec
        ...
        </RootNodes>

where EnumSpec is on a single line (using ``\'' to escape linefeeds), with the following syntax:

        URL=Number[,URL-Filter-filename]  \
        Host=Number[,Host-Filter-filename] \
        Access=TypeList \
        Delay=Number \
        Depth=Number

The EnumSpec modifiers are all optional, and have the following meanings:

URL-Max
The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that URL-Max is the maximum number of URLs that are generated during the enumeration, and not a limit on how many URLs can pass through the candidate selection phase (see Section 4.4.4).

URL-Filter-filename
This is the name of a file containing a set of regular expression filters (discussed below) to allow or deny particular LeafNodes in the enumeration.

Host-Max
The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. Note: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''.

Host-Filter-filename
This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''.

Access
If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. gif Valid access method types are: FILE, FTP, Gopher, HTTP, News, Telnet, or WAIS. Use a ``|'' character between type names to allow multiple access methods. For example, ``Access=HTTP|FTP|Gopher'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL.

Delay
This is the number of seconds to wait between server contacts.

Depth
This is the maximum number of levels of enumeration that will be followed during gathering. Depth=0 means that there is no limit to the depth of the enumeration. Depth=1 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to Depth steps away from the specified URL.

By default, URL-Max defaults to 250, URL-Filter defaults to no limit, Host-Max defaults to 1, Host-Filter defaults to no limit, Access defaults to HTTP only, Delay defaults to 1 second, and Depth defaults to zero gif. There is no way to specify an unlimited value for URL-Max or Host-Max.

A filter file has the following syntax:

        Deny  regex
        Allow regex

Note that regex uses the standard UNIX ``regex'' syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''.

As an example file, the following URL-Filter file would allow all URLs except those containing the regular expression ``/gatherers/.*'':

        Deny  /gatherers/.*
        Allow .*

The URL-Filter regular expressions are matched only on the name portion of each URL. Host-Filter regular expressions are matched on the ``hostname:port'' portion of each URL. The order of the Allow and Deny entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``Allow .*'' first no subsequent Deny expressions will be used, since this Allow filter will allow all entries.





next up previous contents index
Next: Example RootNode configuration Up: 4 The Gatherer Previous: 4.2 Basic setup



Darren Hardy
Mon Apr 3 15:22:37 MDT 1995