[ You are here: XTF -> Under the Hood -> Term and Hit Marking ]

Term and Hit Marking

Table of Contents

Term and Hit Marking
Snippet Formation and Marking
Hits in their Original Context
Spanning XML Tags
Special Rules for Marking Stop Words
When a meta-data field, snippet, or marked up document is fed into a crossQuery or dynaXML formatting stylesheet, XTF inserts XML tags to indicate matched terms in context, and also indicates the extent of full-text matches, also called "text hits".

XTF Definition:
text hit, n. A consecutive span of words in a document that matches a text query. (Note that there may be many hits per document.)

This section covers details of which terms are marked where, what "snippets" are and how they are formed, and how hits are marked, both within snippets and in their original context.

Snippet Formation and Marking

Hits found in the full text of a document are, of course, surrounded by other text that can help the user decide if the hit is useful to them. XTF provides this context by calculating a "snippet" for each hit within a document.

XTF Definition:
snippet, n. A section of the source text or a meta-data field, surrounding and including a match (or "hit").

A query may specify the optimal length (in characters) of a snippet, and the system will get as close as it can to that length without exceeding it. The default is 80 characters.

The process used to form snippets is fairly simple:
Note that many hits will contain terms that aren't part of the query and thus won't be marked with <term>. For instance, a search for "dog NEAR skeleton NEAR bone" on the text "The dog chewed on the skeleton's leg bone." would yield:
The <hit><term>dog</term> chewed on the
<term>skeleton's</term> leg <term>bone</term></hit>
Snippets are provided to the Document Formatter stylesheet in crossQuery for displaying a summary of hits in all documents, and to the Document Formatter stylesheet in dynaXML to display a ranked list of snippets in a single document.

Hits in their Original Context

In addition to snippets, dynaXML also inserts <hit> tags and <term> tags into the original XML document before feeding it to the Document Formatter stylesheet. This allows the stylesheet to display the document contents with the hits and terms highlighted in their original context.

The tags are identical to those contained in snippets (except of course that surrounding text need not be inserted, as the hits are marked in the original context.) Additional attributes are added to each hit, giving its score, rank, and hit number.

Note that query terms are marked everywhere they appear in the document text, not just within <hit> markers. This gives the stylesheet the option of highlighting them inside and/or outside hits.

Spanning XML Tags

Complication arises when a hit crosses the boundary between two XML elements in the original source text. XTF takes care not to alter the structure of the document, so hit tags in this circumstance are inserted in a special way. Essentially, the hit is divided into several stretches of unbroken text. The first stretch is marked with an <xtf:hit> tag with its continues attribute set to "yes". The subsequent stretches are each marked with an <xtf:more> tag; the continues attribute for each is "yes" except for the last, which is "no".

The goal is to allow the Document Formatter stylesheet to present a seamless interface to the end-user, who is probably unaware of the underlying structure of a document and is only concerned with where the hits fall.

For example, say we had the following source text containing <i> to mark an italicized section:
If the user searched for "plant NEAR human" they would expect results something like this:
To support this sort of behavior, XTF would mark up the source document like this (broken into multiple lines and indented for clarity):
The hungry
<hit hitNum="1" continues="yes">
  <term>plant</term> yearned for
</hit>
<i>
  <more hitNum="1" continues="no">
    <term>human</term>
  </more> flesh
</i>
to fill its bottomless gullet.
Let's consider another example. If the user searched for "plant NEAR bottomless", the resulting hit would completely span across the <i> tag. Reasonable results from the Document Formatter could be:
Here is how XTF would mark up this example:
The hungry
<hit hitNum="1" continues="yes">
  <term>plant</term> yearned for
</hit>
<i>
  <more hitNum="1" continues="yes">
    human flesh
  </more>
</i>
<more hitNum="1" continues="no">
  to fill its <term>bottomless</term>
</more> gullet.


Special Rules for Marking Stop Words

When stop words are part of the query, special rules are applied when marking them in snippets and in their original document context: