[ You are here:
XTF ->
Under the Hood -> Term and Hit Marking ]
Term and Hit Marking
When a meta-data field, snippet, or marked up document is fed into a
crossQuery or
dynaXML formatting stylesheet,
XTF inserts XML tags to indicate matched terms in context, and also indicates the extent of full-text matches, also called "text hits".
XTF Definition: |
text hit, n. A consecutive span of words in a document that matches a text query. (Note that there may be many hits per document.) |
This section covers details of which terms are marked where, what "snippets" are and how they are formed, and how hits are marked, both within snippets and in their original context.
Snippet Formation and Marking
Hits found in the full text of a document are, of course, surrounded by other text that can help the user decide if the hit is useful to them.
XTF provides this context by calculating a "snippet" for each hit within a document.
XTF Definition: |
snippet, n. A section of the source text or a meta-data field, surrounding and including a match (or "hit"). |
A query may specify the optimal length (in characters) of a snippet, and the system will get as close as it can to that length without exceeding it. The default is 80 characters.
The process used to form snippets is fairly simple:
- First, the Text Engine locates the matching text and surrounds it with a <hit> tag.
- Next, the engine adds words found in the source document before and after the hit, until it cannot add any more without exceeding the specified snippet size. It attempts to equalize the amount of context added before vs. after the hit.
- Finally, each matching term is marked with a <term> tag.
Note that many hits will contain terms that aren't part of the query and thus won't be marked with
<term>. For instance, a search for "dog NEAR skeleton NEAR bone" on the text "The dog chewed on the skeleton's leg bone." would yield:
The <hit><term>dog</term> chewed on the
<term>skeleton's</term> leg <term>bone</term></hit>
Snippets are provided to the
Document Formatter stylesheet in
crossQuery for displaying a summary of hits in all documents, and to the
Document Formatter stylesheet in
dynaXML to display a ranked list of snippets in a single document.
Hits in their Original Context
In addition to snippets,
dynaXML also inserts
<hit> tags and
<term> tags into the original XML document before feeding it to the
Document Formatter stylesheet. This allows the stylesheet to display the document contents with the hits and terms highlighted in their original context.
The tags are identical to those contained in snippets (except of course that surrounding text need not be inserted, as the hits are marked in the original context.) Additional attributes are added to each hit, giving its score, rank, and hit number.
Note that query terms are marked everywhere they appear in the document text, not just within
<hit> markers. This gives the stylesheet the option of highlighting them inside and/or outside hits.
Spanning XML Tags
Complication arises when a hit crosses the boundary between two XML elements in the original source text.
XTF takes care not to alter the structure of the document, so hit tags in this circumstance are inserted in a special way. Essentially, the hit is divided into several stretches of unbroken text. The first stretch is marked with an
<xtf:hit> tag with its continues attribute set to "yes". The subsequent stretches are each marked with an
<xtf:more> tag; the continues attribute for each is "yes" except for the last, which is "no".
The goal is to allow the
Document Formatter stylesheet to present a seamless interface to the end-user, who is probably unaware of the underlying structure of a document and is only concerned with where the hits fall.
For example, say we had the following source text containing <i> to mark an italicized section:
- The hungry plant yearned for <i>human flesh</i> to fill its bottomless gullet.
If the user searched for "plant NEAR human" they would expect results something like this:
- The hungry plant yearned for human flesh to fill its bottomless gullet.
To support this sort of behavior,
XTF would mark up the source document like this (broken into multiple lines and indented for clarity):
The hungry
<hit hitNum="1" continues="yes">
<term>plant</term> yearned for
</hit>
<i>
<more hitNum="1" continues="no">
<term>human</term>
</more> flesh
</i>
to fill its bottomless gullet.
Let's consider another example. If the user searched for "plant NEAR bottomless", the resulting hit would completely span across the
<i> tag. Reasonable results from the
Document Formatter could be:
- The hungry plant yearned for human flesh to fill its bottomless gullet.
Here is how
XTF would mark up this example:
The hungry
<hit hitNum="1" continues="yes">
<term>plant</term> yearned for
</hit>
<i>
<more hitNum="1" continues="yes">
human flesh
</more>
</i>
<more hitNum="1" continues="no">
to fill its <term>bottomless</term>
</more> gullet.
Special Rules for Marking Stop Words
When stop words are part of the query, special rules are applied when marking them in snippets and in their original document context:
- Within a snippet, stop words are only marked with a <term> tag when they appear within a <hit>, and then only as part of an adjoining non-stop-word that actually contributed to the match.
- Likewise, when the original XML document is marked up for the Document Formatter, stop words are only marked within a <hit>, and only as part of an adjoining term that contributed to the match.