org.apache.lucene.spelt
Class SpellReader

Object
  extended by SpellReader

public class SpellReader
extends Object

Reads a spelling dictionary created by SpellWriter, and provides fast single- and multi-word spelling suggestions. Typical usage:

  1. First, open a new reader.
  2. For each potentially mispelled query, gather the keywords and get suggestions for them.
  3. When done with all queries, close() the reader.

Inspired by and very distantly based on Nicolas Maisonneuve / David Spencer code.

Author:
Martin Haye

Nested Class Summary
private  class SpellReader.Phrase
          Track an ordered group of words.
private  class SpellReader.Word
          Keeps track of a single word, either an original or suggested word.
private static class SpellReader.WordQueue
          Queue of words, ordered by score and then frequency
 
Field Summary
private  PrintWriter debugWriter
          Where to send debugging info (or null for none)
private  CharsetDecoder edMapDecoder
          Charset decoder for reading edit map entries
private  RandomAccessFile edMapFile
          File for reading edit map entries
private  IntList edMapKeys
          Keys in the edit map file
private  IntList edMapPosns
          Positions in the edit map file
private  int[] freqSamples
          Frequencies from the term data, sampled at 5 levels
private  FreqData pairFreqs
          Pair frequency data
(package private)  File spellDir
          The spell index directory
private  Pattern splitPat
          Pattern used for splitting up lines delimited by bars
private  Set stopSet
          Set of stop-words to use during spell correction, or null for none
private  WordEquiv wordEquiv
          Word equivalency checker
private  FreqData wordFreqs
          Word frequency data
 
Constructor Summary
private SpellReader()
          Private constructor -- use open(File) instead.
 
Method Summary
private  String calcMetaphone(String word)
           
 void close()
          Closes any open files and/or resources associated with the SpellReader
private  int comboChar(int c)
           
private  int comboKey(String word, int p0, int p1, int p2, int p3)
          Calculate a four letter key for the given word, by sticking together characters from the given positions.
protected  void finalize()
           
private  void findCloseWords(SpellReader.Word orig, int minFreq, SpellReader.WordQueue queue)
          Find words "close" to the given one, and add them to a queue.
 boolean inDictionary(String word)
          Check if the given word is in the spelling dictionary
static boolean isValidDictionary(File spellDir)
          Check if there's a valid dictionary in the given directory
private  void loadFreqSamples()
          Get the term frequency sample array for our dictionary.
private  void loadWordFreqs()
          Get the term frequency sample array for our dictionary.
private  SpellReader.Phrase max(SpellReader.Phrase orig, SpellReader.Phrase test)
          Return the better of two phrases (an original phase vs. a test phrase).
static SpellReader open(File spellIndexDir)
          Open a reader for the given spelling index directory.
private  void openEdmap()
          Read the index for the edit map file
private  void openPairFreqs()
           
private  boolean readEdKey(SpellReader.Word orig, int key, int minFreq, LongSet checked, SpellReader.WordQueue queue)
          Read the list of edit-map words for the given 4-character key.
private  float scorePair(SpellReader.Word sugg1, SpellReader.Word sugg2)
          Calculate a score for a suggested replacement for a given word.
 void setDebugWriter(PrintWriter w)
          Establishes a destination for detailed debugging output
 void setStopwords(Set set)
          Establishes a list of stopwords (e.g.
 void setWordEquiv(WordEquiv eq)
          Establishes a word equivalency checker.
private  SpellReader.Phrase subJoin(SpellReader.Phrase in, int pos1, int pos2)
          Consider joining the first two words together
private  SpellReader.Phrase subPair(SpellReader.Phrase in, int pos1, int pos2)
          Consider a set of changes to the pair of words at the given position.
private  SpellReader.Phrase subPairs(SpellReader.Phrase in)
          Consider pair-wise changes at each position.
private  SpellReader.Phrase subSplit(SpellReader.Phrase in, int pos)
          Consider splitting a word
private  SpellReader.Phrase subWord(SpellReader.Phrase in, int pos)
          Substitute a single word at the given position, trying to improve the score.
 String[] suggestKeywords(String[] terms)
          Keyword-oriented spelling suggestion mechanism.
private  SpellReader.Word[] suggestSimilar(SpellReader.Word word, int numSugg, int minFreq)
          Suggest similar words to a given original word.
 String[] suggestSimilar(String str, int numSugg)
          Suggest similar words to a given original word, but not including the word itself.
 
Methods inherited from class Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

spellDir

File spellDir
The spell index directory


edMapKeys

private IntList edMapKeys
Keys in the edit map file


edMapPosns

private IntList edMapPosns
Positions in the edit map file


edMapFile

private RandomAccessFile edMapFile
File for reading edit map entries


edMapDecoder

private CharsetDecoder edMapDecoder
Charset decoder for reading edit map entries


pairFreqs

private FreqData pairFreqs
Pair frequency data


wordFreqs

private FreqData wordFreqs
Word frequency data


freqSamples

private int[] freqSamples
Frequencies from the term data, sampled at 5 levels


debugWriter

private PrintWriter debugWriter
Where to send debugging info (or null for none)


splitPat

private final Pattern splitPat
Pattern used for splitting up lines delimited by bars


stopSet

private Set stopSet
Set of stop-words to use during spell correction, or null for none


wordEquiv

private WordEquiv wordEquiv
Word equivalency checker

Constructor Detail

SpellReader

private SpellReader()
Private constructor -- use open(File) instead.

Method Detail

isValidDictionary

public static boolean isValidDictionary(File spellDir)
Check if there's a valid dictionary in the given directory


open

public static SpellReader open(File spellIndexDir)
                        throws IOException
Open a reader for the given spelling index directory. Does no stop word processing, and uses default word equivalency (just case insensitive.) To specify a stopword set (which you must if you did when building the dictionary), call setStopwords(Set). To specify a non-default word equivalency, call setWordEquiv(WordEquiv).

Parameters:
spellIndexDir - directory containing the spelling dictionary
Throws:
IOException

setStopwords

public void setStopwords(Set set)
Establishes a list of stopwords (e.g. "the", "and", "an", etc.). This list should be identical to that which was used to create the dictionary.

Parameters:
set - Set of stop-words; all should be lower-case.

setWordEquiv

public void setWordEquiv(WordEquiv eq)
Establishes a word equivalency checker. This is used to prevent the correction algorithm from making suggestions that won't change the query result. For instance, if words in the main index are all converted from plural to singular, it would be silly for the checker to suggest "cats" to replace "cat".

Parameters:
eq - the equivalency checker to use

openEdmap

private void openEdmap()
                throws IOException
Read the index for the edit map file

Throws:
IOException

close

public void close()
           throws IOException
Closes any open files and/or resources associated with the SpellReader

Throws:
IOException

setDebugWriter

public void setDebugWriter(PrintWriter w)
Establishes a destination for detailed debugging output


readEdKey

private boolean readEdKey(SpellReader.Word orig,
                          int key,
                          int minFreq,
                          LongSet checked,
                          SpellReader.WordQueue queue)
                   throws IOException
Read the list of edit-map words for the given 4-character key.

Parameters:
orig - the original word being considered
key - the 4-char key to look up
minFreq - minimum frequency of words to be queued
checked - set of words that have already been considered
queue - receives the resulting words
Returns:
true iff the key was found
Throws:
IOException

findCloseWords

private void findCloseWords(SpellReader.Word orig,
                            int minFreq,
                            SpellReader.WordQueue queue)
                     throws IOException
Find words "close" to the given one, and add them to a queue. In this case, "close" means that the first six characters have an edit distance of 2 or less. Well, it means approximately that anyway. More precisely, we iterate all possible 4-letter keys that can be constructed by deleting two of the first six characters in the word. For each key, we add all words that share it.

Throws:
IOException

comboKey

private int comboKey(String word,
                     int p0,
                     int p1,
                     int p2,
                     int p3)
Calculate a four letter key for the given word, by sticking together characters from the given positions.


comboChar

private int comboChar(int c)

inDictionary

public boolean inDictionary(String word)
                     throws IOException
Check if the given word is in the spelling dictionary

Throws:
IOException

suggestSimilar

public String[] suggestSimilar(String str,
                               int numSugg)
                        throws IOException
Suggest similar words to a given original word, but not including the word itself.

Throws:
IOException

suggestSimilar

private SpellReader.Word[] suggestSimilar(SpellReader.Word word,
                                          int numSugg,
                                          int minFreq)
                                   throws IOException
Suggest similar words to a given original word. A minimum frequency limit is enforced.

Throws:
IOException

suggestKeywords

public String[] suggestKeywords(String[] terms)
                         throws IOException
Keyword-oriented spelling suggestion mechanism. For an ordered list of terms, come up with suggestions that have a good chance of improving the precision and/or recall.

Parameters:
terms - Ordered list of query terms
Returns:
One suggestion per term. If unchanged, there was no better suggestion. If null, it is suggested that the term be deleted. If the array returned is null, there were no suggestions at all.
Throws:
IOException

subWord

private SpellReader.Phrase subWord(SpellReader.Phrase in,
                                   int pos)
                            throws IOException
Substitute a single word at the given position, trying to improve the score.

Parameters:
in - the best we've done so far
pos - position to substitute at
Returns:
the best we can do at that position
Throws:
IOException

max

private SpellReader.Phrase max(SpellReader.Phrase orig,
                               SpellReader.Phrase test)
                        throws IOException
Return the better of two phrases (an original phase vs. a test phrase). If a debug stream has been specified, output debug info too.

Throws:
IOException

subPairs

private SpellReader.Phrase subPairs(SpellReader.Phrase in)
                             throws IOException
Consider pair-wise changes at each position.

Throws:
IOException

subPair

private SpellReader.Phrase subPair(SpellReader.Phrase in,
                                   int pos1,
                                   int pos2)
                            throws IOException
Consider a set of changes to the pair of words at the given position.

Parameters:
in - the current best we've found
pos1 - first position to consider
pos2 - second position to consider
Returns:
new best
Throws:
IOException

subSplit

private SpellReader.Phrase subSplit(SpellReader.Phrase in,
                                    int pos)
                             throws IOException
Consider splitting a word

Throws:
IOException

subJoin

private SpellReader.Phrase subJoin(SpellReader.Phrase in,
                                   int pos1,
                                   int pos2)
                            throws IOException
Consider joining the first two words together

Throws:
IOException

scorePair

private float scorePair(SpellReader.Word sugg1,
                        SpellReader.Word sugg2)
                 throws IOException
Calculate a score for a suggested replacement for a given word.

Throws:
IOException

loadFreqSamples

private void loadFreqSamples()
                      throws IOException
Get the term frequency sample array for our dictionary.

Throws:
IOException

loadWordFreqs

private void loadWordFreqs()
                    throws IOException
Get the term frequency sample array for our dictionary.

Throws:
IOException

openPairFreqs

private void openPairFreqs()
                    throws IOException
Throws:
IOException

finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable

calcMetaphone

private String calcMetaphone(String word)