[ You are here: XTF -> Programming -> textIndexer -> Pre-Filter ]

Pre-Filter Programming

Table of Contents

Pre-Filter Programming
Defining the XTF Namespace
Adding or Marking Meta-Data
Preventing Text from being Indexed
Sectioning Documents
Relevance Boost
Pre-Filters and Lazy-Tree Building
Controlling Proximity
Summary
This section describes how to program an XTF Pre-Filter stylesheets. If you want to skip the tutorial, you can check out the Reference section or the some of the default pre-filters provided with XTF: teiPrefilter.xsl, eadPrefilter.xsl, etc.

The primary purpose of textIndexer Pre-Filters is to add or augment a document prior to indexing it. The pre-filter used for any particular document is defined by the Document Selector. The main aspects of programming a pre-filter are described in the following subsections.

Defining the XTF Namespace

For the textIndexer pre-filter to work properly, an xtf: namespace must be declared at the top of the pre-filter. To do this, simply add the following attribute to the <xsl:stylesheet> tag at the top of the pre-filter:
xmlns:xtf="http://cdlib.org/xtf"
Defining an xtf: namespace in this way and then prefixing textIndexer specific attributes with it allows the textIndexer to distinguish its own attributes from other ones in the filtered document.

Adding or Marking Meta-Data

This is typically a very important task for the pre-filter: to record "meta-data" for a document. Meta-data is simply information about a document that is not part of the document text itself. The author name, document publication date, and subject are all examples of meta-data. The textIndexer system supports the concept of meta-data through the use of the xtf:meta attribute. Using the pre-filter to add this attribute to a tag causes name of the tag and its contents to be recorded in a special meta-data section of the index for the document. For example:
<xsl:template match="PublicationInfo">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:meta" select="'true'"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
This snippet of pre-filter code would take any tag with the name PublicationInfo and add a meta attribute to it, thus telling the textIndexer to add the publication info to the meta-data index for the current document rather than the main text index. Once meta-data has been recorded for a document, it can be searched by modifying the crossQuery servlet's Query Parser Stylesheet to generate meta search requests. Doing so is described in detail below in the Query Parser Programming section below.

Note: If you mark a section of text with the xtf:meta attribute, it will not be included in the full text index of that document (accessed by querying the text field). If you want a given piece of text to appear in both the meta-data and full-text indexes, make two copies of it, marking one with xtf:meta and not marking the other.

For a real world example, take a look at this use the default EAD pre-filter: eadPreFilter.xsl - origination.

Another way in which meta-data for a document can be used is as a sort key. Sort keys are used by the crossQuery servlet to reorder how query matches are displayed for the user. To use a meta-field as a sort key, its contents must not be tokenized (i.e. it must not be broken up into words). Since tokenizing is turned on by default to make a meta-data field searchable, the pre-filter code that processes meta-data must explicitly turn tokenizing off. This is accomplished as follows:
<xsl:template match="PublicationInfo">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:meta" select="'true'"/>
        <xsl:attribute name="xtf:tokenize" select="'no'"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
Like the previous example, this pre-filter code would take any tag with the name PublicationInfo and mark it as meta-data. But the addition of the line of the xtf:tokenize="no" attribute disables tokenizing so that the meta-data can be used as a sort key by the crossQuery servlet.

For a real-world example of marking sortable meta-data, see the common pre-filter code supplied with XTF: preFilterCommon.xsl - sort title.

It is important to note that since meta-data must be tokenized to be searchable, and it must not be tokenized to be used as a sort key, meta-based searching and sorting operations on the same exact field name are mutually exclusive. If you want to perform both searching and sorting on a collection of meta-data, you'll need to add code to your pre-filter to produce two copies of that meta-data with separate field names: one copy for searching that is tokenized, and one copy for sorting that is not tokenized.

Preventing Text from being Indexed

There will be times when the text within certain tags in the XML representation for a document should not be indexed (e.g.: versioning information about the original XML file format.) The XSLT pre-filter can be prevent such tags and their associated text from being indexed. There are two possible ways to do this:
  1. Standard XSLT programming can be used to eliminate the tag and its text entirely.
  2. A special noindex attribute can be added to the tag to tell the textIndexer to ignore its contents when indexing.
Eliminating certain tags through the use of standard XSLT techniques has the advantage that it saves space. The tag's text is not added to the search index, nor is it stored by the fast retrieval database for later display by the dynaXML servlet. By contrast, the noindex attribute simply prevents the tag's text from being indexed. The text is still stored in the fast retrieval database so that the dynaXML servlet can display the text if necessary.

A snippet of code showing the use of the noindex attribute can be found in the sample preFilter.xml file that is included with the default XTF installation. It looks as follows:
<xsl:template match="teiHeader">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:noindex" select="'true'"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
Notice that the noindex attribute when used in the pre-filter is prefixed with xtf: . This is the namespace used in XTF tags and attributes to prevent collisions with similarly named tags and attributes defined by other programs.

Finally, it should be mentioned that the noindex attribute has two forms:
noindex = true/yes, false/no
and
index   = false/no, true/yes
Both forms enable or disable indexing, but their logic is inverted so that the XSLT programmer can choose the wording that makes the most sense in any given situation.

For a real-world example of suppressing indexing for a section of text, see the TEI pre-filter code supplied with XTF: teiPreFilter.xsl - ignored elements.

Sectioning Documents

Another attribute that can be added to document tags is the xtf:sectionType attribute. This attribute allows you to assign names to tags within a document. Doing so would allow permit advanced user queries that only search for text in specific section types. Consider the following example:
<xsl:template match="ChapterTitle">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:sectionType" select="'ChapterTitle'"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
This XSLT code simply labels the text indexed for a chapter title with a "ChapterTitle" section type. With the text labeled in this manner, the query page presented to the user could provide an advanced search option to look for text only in chapter titles. We talk more about how to actually do this in the section below on programming the crossQuery servlet's Query Parser Stylesheet.

For a real world example of marking section types, see the TEI pre-filter supplied with XTF: teiPreFilter.xsl - sectionType.

One other thing to mention about sectionType attributes is that they may be used in nested tags. The textIndexer maintains an internal stack of nested section types, and correctly restores previous section types when a given section/tag ends. If you are trying to represent hierarchical information in a section type, you can use the xtf:sectionTypeAdd attribute on child elements to append to their parents' section type information.

Relevance Boost

There may be times when it is useful to either boost or else de-emphasize the relevance of text in a particular part of a document. Consider the case where you had a document that was a book of quotations. In such a document, it might make sense to boost the relevance of the text in the actual quotations as compared to any text that discussions the quotations. To facilitate this, the textIndexer pre-filter provides a wordBoost attribute. The following example illustrates its use:
<xsl:template match="Quotation">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:wordboost" select="1.5"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
This XSLT code simply boosts text found in Quotation tags to be 1.5 times more relevant than non-boosted text in the document. Conversely, to de-emphasize text simply use a value between zero and one (e.g., a boost of 0.5 would make text half as relevant when searching.)

For a real world example of marking section types, see the TEI pre-filter supplied with XTF: teiPreFilter.xsl - sectionType.

As with section attributes, the wordboost attribute may be used in nested tags. The textIndexer maintains an internal stack of nested boost values, and correctly restores previous values when a given section/tag ends. Note however that boost values in nested tags do not accumulate. That is, a tag with a boost value of 1.5 will boost the relevance of its words by 1.5, regardless of the boost values applied to any tags that contain it.

Pre-Filters and Lazy-Tree Building

The XTF system makes use of Lazy Tree files to help speed document retrieval and to help the dynaXML servlet highlight search results in context. By default, lazy tree files are generated at index time by the textIndexer. However, through the use of the -nobuildlazy command-line argument, the textIndexer can be instructed to not build the lazy tree files. In this case, the dynaXML servlet will build the lazy trees files when it needs them.

If the -nobuildlazy command-line is used to delay the building of lazy tree information until document retrieval, it is imperative that the pre-filter specified by the docSelector.xsl stylesheet is the same one specified by the dynaXML's docReqParser.xsl stylesheet. If the two stylesheets use different pre-filters, the search result information generated by the crossQuery servlet will not match the highlighting information in the lazy tree files generated by the dynaXML servlet, and chaos will ensue.

Controlling Proximity

Note: Proximity control is rarely needed, but included here for completeness.

If an XTF user specifies a list of words to search for, the crossQuery servlet will rank any matching words that are closer together as better matches than ones that are far apart. This is what is known as proximity searching.

There are times however when simple proximity matches will produce undesired results. For example, consider the case where a query matches some words in two different places in a document. For the first match, the words are very close together but in two different chapters' tags. For the second match, the words are all in the same chapter, but slightly further apart. In this case, the proximity search mechanism will incorrectly give a higher score to match with the words that are closer together but split across two chapters.

To correct for these kinds of situations, the pre-filter can insert a proximity break attribute into a tag. Doing so effectively puts an infinite distance between the tag with the break and the text before it, thus entirely preventing proximity matches from being found that span the two tags. For example, to solve the "proximity across chapters" problem described above, a pre-filter might include some code like this:
<xsl:template match="chapter">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:proximitybreak" select="'true'"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
In this example, the important code is on first and fourth lines. The first line tells the pre-filter to look for "chapter" tags. And when it finds one, the fourth line adds a proximity break attribute. Adding this code to the pre-filter would ensure that proximity matches are never found that span two "chapter" tags.

Sometimes it may still be desirable to find proximity matches across sections, but de-emphasize them compared to matches found entirely within a section. In this case, the sectionBump attribute can be used in place of a proximity break. Unlike the proximityBreak attribute, the sectionBump attribute can be told how much distance (as a number of words) to introduce between two adjacent sections. For example, this code:
<xsl:template match="chapter">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:sectionBump" select="10"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
would separate adjacent chapters from each other by ten words. Proximity matches across chapters would still be found, but they would be considered 10 words further apart (and therefore less relevant) than similar matches found entirely within a single section.

Just as it may be desirable to de-emphasize proximity matches across adjacent sections, it may also be desirable to control proximity matches across sentence boundaries. To accomplish this, the sentenceBump attribute can be added to a tag like this:
<xsl:template match="DocText">
 
    <xsl:copy>
        <xsl:copy-of select="@*"/>
        <xsl:attribute name="xtf:sentenceBump" select="5"/>
        <xsl:apply-templates/>
    </xsl:copy>
 
</xsl:template>
In this example, a hypothetical tag under which all other document tags and text exist has its sentenceBump value set to 5 words. This effectively separates the end of one sentence from the beginning of the next by five words. Doing so makes proximity matches across sentences less relevant than a similar proximity match entirely within a single sentence.

Note that the default stylesheets provided with XTF, and most installations of XTF, do not use any of these proximity control tags. Most people seem to find that the proximity behavior is fine by default.

Summary

In closing, a few additional facts should be mentioned about the attributes supported by the XTF system:
  1. All the examples above show the textIndexer pre-filter adding attributes to the XML representation for a document. However, for native XML documents, the attributes could have simply been embedded in the original source document tags. The disadvantage of doing so, however, is that the attributes in every XML document would need to be updated whenever indexing changes are made to the XTF system.
  2. A single tag can be assigned more than one attribute. For example, a tag could be assigned both a word boost and a section title if desired. Note however that some combinations (like sectionType + proximitybreak) are redundant and unnecessary.
  3. Currently, when the xtf:meta attribute is added to a tag, all the text-related XTF specific attributes are ignored (e.g., wordboost, proximitybreak, sectionType, etc.)

You're now prepared to check out the Reference section or the some of the default pre-filters provided with XTF: teiPrefilter.xsl, eadPrefilter.xsl, or start working on the code in your own installation under the style/textIndexer directory.