org.cdlib.xtf.textIndexer
Class XtfSpecialTokensFilter
Object
TokenStream
TokenFilter
XtfSpecialTokensFilter
public class XtfSpecialTokensFilter
- extends TokenFilter
The XtfSpecialTokensFilter
class is used by the
XTFTextAnalyzer
class to convert special "bump" count values in
text chunks to actual position increments for words prior to adding them
to a Lucene index.
The way in which Lucene adds words to an index database is to convert a
contiguous chunk of text into a list of discrete words (tokens
in Lucene parlance.) Then, when the Lucene IndexWriter.addDocument()
function is called, Lucene traverses the list of tokens, and calls an
instance of a TokenFilter
derived class to pre-process each
token. The resulting output from the filter is what Lucene actually
adds to the database.
Each token entry in the list consists of the token (word) itself, and its
position increment from the previous token (referred to as "word bump" in
other text indexer related classes.) Since a special bump count value in
the original text looks like any other token to Lucene, it simply passes it
on to the XtfSpecialTokensFilter
to pre-process. The
filter recognizes the special token, removes it from the token list,
converts it to a number, and sets it as the position increment for the
first non-special token that follows. The output of the
XtfSpecialTokensFilter
is then a list of actual tokens to be
indexed and their associated position increments.
For more information on word bump and virtual words, see the
XMLTextProcessor
class, and its member function
insertVirtualWords()
.
Field Summary |
private String |
srcText
A reference to the original contiguous text that the input token list
corresponds. |
Fields inherited from class TokenFilter |
input |
Constructor Summary |
XtfSpecialTokensFilter(TokenStream srcTokens,
String srcText)
Constructor for the XtfSpecialTokensFilter . |
Method Summary |
Token |
next()
Return the next output token from this filter. |
Methods inherited from class TokenFilter |
close |
Methods inherited from class Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
srcText
private String srcText
- A reference to the original contiguous text that the input token list
corresponds. See the
constructor
for more about how this reference is used.
XtfSpecialTokensFilter
public XtfSpecialTokensFilter(TokenStream srcTokens,
String srcText)
- Constructor for the
XtfSpecialTokensFilter
.
- Parameters:
srcTokens
- The source token stream to filter.srcText
- The original source text chunk from wich the source
token stream was derived.
- Notes:
- This class stores a reference to the original chunk of text from which
the source token stream is derived. This is so that the filter can
perform look-back and look-ahead operations to identify special token
by their markers. This is necessary because the standard tokenizer
that creates the source token stream for this filter considers our
markers to be punctuation rather than part of the token, and strips
them out.
next
public Token next()
throws IOException
- Return the next output token from this filter.
Called by Lucene to retrieve the next non-special token from this filter.
- Specified by:
next
in class TokenStream
- Returns:
- The next non-special token output by this filter.
- Throws:
IOException
- Any exceptions generated by the look-back/look-ahead
character processing performed by this function.
- Notes:
- For more information about the filtering performed by this function,
see the
XtfSpecialTokensFilter
class description.