org.cdlib.xtf.textIndexer
Class XTFTextAnalyzer

Object
  extended by Analyzer
      extended by XTFTextAnalyzer

public class XTFTextAnalyzer
extends Analyzer

The XTFTextAnalyzer class performs the task of breaking up a contiguous chunk of text into a list of separate words (tokens in Lucene parlance.) The resulting list of words is what Lucene iterates through and adds to its database for an index.

Within this analyzer, there are four main phases:

Tokenizing
The first phase is the conversion of the contiguous text into a list of separate tokens. This step is performed by the FastTokenizer class. This class uses a set of rules to separate words in western text from the spacing and puntuation that normally accompanies it. The FastTokenizer class uses the same basic tokenizing rules as the Lucene StandardAnalyzer class, but has been optimized for speed.

Special Token Filtering
In the process of creating chunks of text for indexing, the Text Indexer program inserts virtual words and other special tokens that help it to relate the chunks of text stored in the Lucene index back to its original XML source text. The XTFTextAnalyzer looks for those special tokens, removes them from the token list, and translates them into position increments for the first non-special tokens they preceed. For more about special token filtering, see the XtfSpecialTokensFilter class.

Lowercase Conversion
The next step performed by the XTFTextAnalyzer is to convert all the remaining tokens in the token list to lowercase. Converting indexed words search phrases to lowercase has the effect of making searches case insensitive.

Plural and Accent Folding
Next the XTFTextAnalyzer converts plural words to singular form using a WordMap, and strips diacritics from the word using an CharMap. These conversions can yield more complete search results.

Stop-Word Filtering
The next step performed by the XTFTextAnalyzer is to remove certain words called stop-words. Stop-words are words that by themselves are not worth indexing, such as a, the, and, of, etc. These words appear so many times in English text, that indexing all their occurences just slows down searching for other words without providing any real value. Consequently, they are filtered out of the token list.

It should be noted, however, that while stop-words are filtered, they are not simply omitted from the database. This is because stop-words do impart special meaning when they appear in certain phrases or titles. For example, in Man of War the word of doesn't simply act as a conjunction, but rather helps form the common name for a type of jellyfish. Similarly, the word and in the phrase black and white doesn't simply join black and white, but forms a phrase meaning a condition where no ambiguity exists. In these cases it is important to preserve the stop-words, because ignoring them would produce undesired matches. For example, in a search for the words "man of war" (meaning the jellyfish), ignoring stop-words would produce "man and war", "man in war", and "man against war" as undesired matches.

To record stop-words in special phases without slowing searching, the XTFTextAnalyzer performs an operation called bi-gramming for its third phase of filtering. For more details about how bi-grams actually work, see the BigramStopFilter class.

Adding End Tokens
As a final step, the analyzer double-indexes the first and last tokens of fields that contain the special start-of-field and end-of-field characters. Essentially, those tokens are indexed with and without the markers. This enables exact matching at query time, since Lucene offers no other way to determine the end of a field. Note that this processing is only performed on non-text fields (i.e. meta-data fields.)

Once the XTFTextAnalyzer has completed its work, it returns the final list of tokens back to Lucene to be added to the index database.


Field Summary
private  CharMap accentMap
          The set of accented chars to remove diacritics from
private  HashSet facetFields
          List of fields marked as "facets" and thus get special tokenization
private  HashSet misspelledFields
          List of fields that marked as possibly misspelled, and thus don't get added to the spelling correction dictionary.
private  WordMap pluralMap
          The set of words to change from plural to singular
private  SpellWriter spellWriter
          If building a spelling correction dictionary, this is the writer
private  String srcText
          A reference to the contiguous source text block to be tokenized and filtered.
private  Set stopSet
          The list of stop-words currently set for this filter.
 
Constructor Summary
XTFTextAnalyzer(Set stopSet, WordMap pluralMap, CharMap accentMap)
          Constructor.
 
Method Summary
 void addFacetField(String fieldName)
          Mark a field as a "facet field", that will receive special tokenization to deal with hierarchy.
 void addMisspelledField(String fieldName)
          Mark a field as a "misspelled field", that won't be added to the spelling correction dictionary.
 void clearFacetFields()
          Clears the list of fields marked as facets.
 void clearMisspelledFields()
          Clears the list of fields marked as misspelled.
 void setSpellWriter(SpellWriter writer)
          Sets a writer to receive tokenized words just before they are indexed.
 TokenStream tokenStream(String fieldName, Reader reader)
          Convert a chunk of contiguous text to a list of tokens, ready for indexing.
 
Methods inherited from class Analyzer
getPositionIncrementGap
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stopSet

private Set stopSet
The list of stop-words currently set for this filter.


pluralMap

private WordMap pluralMap
The set of words to change from plural to singular


accentMap

private CharMap accentMap
The set of accented chars to remove diacritics from


srcText

private String srcText
A reference to the contiguous source text block to be tokenized and filtered. (Used by the tokenStream() method to read the source text for filter operations in random access fashion.)


facetFields

private HashSet facetFields
List of fields marked as "facets" and thus get special tokenization


misspelledFields

private HashSet misspelledFields
List of fields that marked as possibly misspelled, and thus don't get added to the spelling correction dictionary.


spellWriter

private SpellWriter spellWriter
If building a spelling correction dictionary, this is the writer

Constructor Detail

XTFTextAnalyzer

public XTFTextAnalyzer(Set stopSet,
                       WordMap pluralMap,
                       CharMap accentMap)
Constructor.

This method creates a XTFTextAnalyzer and initializes its member variables.

Parameters:
stopSet - The set of stop-words to be used when filtering text. For more information about stop-words, see the XTFTextAnalyzer class description.
pluralMap - The set of plural words to de-pluralize when filtering text. See IndexInfo.pluralMapPath for more information.
accentMap - The set of accented chars to remove diacritics from when filtering text. See IndexInfo.accentMapPath for more information.
Notes:
Use this method to initialize an instance of an XTFTextAnalyzer and pass it to a Lucene IndexWriter instance. Lucene will then call the tokenStream() method each time a chunk of text is added to the index.

Method Detail

clearFacetFields

public void clearFacetFields()
Clears the list of fields marked as facets. Facet fields receive special tokenization.


addFacetField

public void addFacetField(String fieldName)
Mark a field as a "facet field", that will receive special tokenization to deal with hierarchy.

Parameters:
fieldName - Name of the field to consider a facet field.

clearMisspelledFields

public void clearMisspelledFields()
Clears the list of fields marked as misspelled. Misspelled fields are not added to the spelling correction dictionary.


addMisspelledField

public void addMisspelledField(String fieldName)
Mark a field as a "misspelled field", that won't be added to the spelling correction dictionary.

Parameters:
fieldName - Name of the field to consider a misspelled field.

setSpellWriter

public void setSpellWriter(SpellWriter writer)
Sets a writer to receive tokenized words just before they are indexed. Use this to build a spelling correction dictionary at index time.

Parameters:
writer - The writer to add words to

tokenStream

public TokenStream tokenStream(String fieldName,
                               Reader reader)
Convert a chunk of contiguous text to a list of tokens, ready for indexing.

Specified by:
tokenStream in class Analyzer
Parameters:
fieldName - The name of the Lucene database field that the resulting tokens will be place in. Used to decide which filters need to be applied to the text.
reader - A Reader object from which the source text
Returns:
A filtered TokenStream containing the tokens that should be indexed by the Lucene database.