Starting with Version 1.2, Harvest summarizes HTML using the generic SGML summarizer described in Section 4.4.2. Below is the default SGML-to-SOIF table used by the HTML summarizer. The pathname to this file is $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer lib directory.
HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <A> keywords,parent <A:HREF> url-references <ADDRESS> address <B> keywords,parent <BODY> body <CITE> references <CODE> ignore <EM> keywords,parent <H1> headings <H2> headings <H3> headings <H4> headings <H5> headings <H6> headings <HEAD> head <I> keywords,parent <IMG:SRC> images <META:CONTENT> $NAME <STRONG> keywords,parent <TITLE> title <TT> keywords,parent <UL> keywords,parent
In HTML, the document title is written as:
<TITLE>My Home Page</TITLE>
The above translation table will place this in the SOIF summary as:
title{13}: My Home Page
Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words.
Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute.
URLs in HTML anchors are written as
<A HREF="http://harvest.cs.colorado.edu/">
The specification for <A:HREF>
in the above translation table causes
this to appear as
url-references{32}: http://harvest.cs.colorado.edu/
One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is:
<META NAME="author" CONTENT="Joe T. Slacker">
By specifying ``<META:CONTENT>
$NAME'' in the translation table, this
comes out as:
author{15}: Joe T. Slacker
HTML authors can easily add a list of keywords to their documents:
<META NAME="keywords" CONTENT="word1 word2"> <META NAME="keywords" CONTENT="word3 word4">