org.apache.lucene.spelt
Class SpellWriter

Object
  extended by SpellWriter

public class SpellWriter
extends Object

Writes spelling dictionaries, which can later be used by SpellReader to obtain spelling suggestions. Provides efficient, high-volume updates to a spelling correction dictionary. Typical steps for creating a dictionary:

  1. First, open a new writer.
  2. Repeatedly queue words to be added to the dictionary. This writes the words and pairs to a simple disk file.
  3. Optionally flush the queued words, processing them into a final dictionary.
  4. Finally, close() the writer.

Inspired by and very distantly based on Nicolas Maisonneuve / David Spencer code.

Author:
Martin Haye

Field Summary
private static int DEFAULT_MIN_PAIR_FREQ
          Default minimum pair frequency = 2
private static int DEFAULT_MIN_WORD_FREQ
          Default minimum word frequency = 2
private static DoubleMetaphone doubleMetaphone
          Used for calculating double metaphone keys
private  StringBuffer edmapBuf
          String buffer for edmap pairs
private  File edmapFile
          File containing edit map data
private  File freqFile
          File containing compiled word frequencies
private  char[] keyChars
          Character array for forming combo keys
private static int MAX_RECENT_PAIRS
          Max # of pairs to hash before flushing
private static int MAX_RECENT_WORDS
          How large to make the cache of recently added words
private  int minPairFreq
          Minimum frequency for pairs to retain
private  int minWordFreq
          Minimum frequency for words to retain
private  File pairFreqFile
          File containing compiled pair frequency data
private  File pairQueueFile
          File to queue words into
private  PrintWriter pairQueueWriter
          For writing to the pair queue
private  String prevWord
          The previous word queued, or null if none (or a break was queued)
private  HashMap<String,Integer> recentPairs
          For counting pair frequencies prior to write
private  HashMap<String,Integer> recentWords
          For counting word frequencies prior to write
private  File sampleFile
          File containing frequency sample data
private  int SORT_MEM_LIMIT
          Memory limit for sorting
private  File spellIndexDir
          Directory to store the spelling dictionary in
(package private)  Pattern splitPat
          Used for splitting lines delimited with bar
private  Set stopSet
          Set of stop words in use; default is null for no stop set
private  File wordQueueFile
          File to queue words into
private  PrintWriter wordQueueWriter
          For writing to the word queue
 
Constructor Summary
private SpellWriter()
          Private constructor -- do not construct directly; rather, use the static open(File) method.
 
Method Summary
private  void addCombo(String word, FileSorter edmapSorter, int p0, int p1, int p2, int p3)
          Add a combination of letters to the edit map
private  void addCombos(String word, FileSorter edMapSorter)
          Add combinations of the first six letters of the word, capturing all the possibilities that represent an edit distance of 2 or less.
 boolean anyWordsQueued()
          Check if any words are queued for add.
static String calcMetaphone(String word)
           
 void clearDictionary()
          Delete all words in the dictionary (including those queued on disk)
 void close()
          Closes all files.
private  void closeQueueWriters()
          Closes the queue writers if either are open
private  char comboChar(char c)
           
private  char[] comboKey(String word, int p0, int p1, int p2, int p3)
          Calculate a key from the given characters of the word.
private  void condenseEdmapKey(String key, ArrayList<String> words, Writer out)
          Perform prefix compression on a list of words for a single edit map key.
private  void deleteFile(File file)
          Attempt to delete (and at least truncate) the given file.
protected  void finalize()
           
private  void flushPhase1(ProgressTracker prog)
          Performs the word-adding phase of the flush procedure.
private  void flushPhase2(ProgressTracker prog)
          Performs the pair-adding phase of the flush procedure.
 void flushQueuedWords()
          Ensures that all words in the queue are written to the dictionary on disk.
 void flushQueuedWords(ProgressTracker prog)
          Ensures that all words in the queue are written to the dictionary on disk.
private  void flushRecentPairs()
          Flush any accumulated pairs, with their counts.
private  void flushRecentWords()
          Flush any accumulated words, with their counts.
static SpellWriter open(File spellIndexDir)
          Creates a SpellWriter, and establishes the directory to store the dictionary in.
private  void openInternal(File spellIndexDir)
          Establishes the directory to store the dictionary in.
private  void openPairQueueWriter()
          Opens the pair queue writer.
private  void openWordQueueWriter()
          Opens the word queue writer.
 void queueBreak()
          Called to signal a break in the text, to inform the spell checker to avoid pairing the previous word with the next one.
 void queueWord(String word)
          Queue the given word.
private  void readFreqs(File inFile, FileSorter out, ProgressTracker prog)
          Read an existing frequency file, and add it to a file sorter.
private  void replaceFile(File oldFile, File newFile)
          Replace an old file with a new one
 void setMinPairFreq(int freq)
          Establish a minimum pair frequency.
 void setMinWordFreq(int freq)
          Establish a minimum word frequency.
 void setStopwords(Set set)
          Establishes a set of stop words (e.g.
private  void writeEdMap(FileSorter edmapSorter, File outFile, ProgressTracker prog)
          Write out a prefix-compressed edit-distance map, which also contains term frequencies.
private  void writeFreqs(File outFile, FileSorter freqSorter, IntList allFreqs, FileSorter edmapSorter, ProgressTracker prog)
          Write out frequency data, in sorted order.
private  void writeFreqSamples(IntList allFreqs, File file, ProgressTracker prog)
          Write term frequency samples to the given file.
 
Methods inherited from class Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

spellIndexDir

private File spellIndexDir
Directory to store the spelling dictionary in


stopSet

private Set stopSet
Set of stop words in use; default is null for no stop set


wordQueueFile

private File wordQueueFile
File to queue words into


prevWord

private String prevWord
The previous word queued, or null if none (or a break was queued)


pairQueueFile

private File pairQueueFile
File to queue words into


freqFile

private File freqFile
File containing compiled word frequencies


sampleFile

private File sampleFile
File containing frequency sample data


edmapFile

private File edmapFile
File containing edit map data


pairFreqFile

private File pairFreqFile
File containing compiled pair frequency data


wordQueueWriter

private PrintWriter wordQueueWriter
For writing to the word queue


pairQueueWriter

private PrintWriter pairQueueWriter
For writing to the pair queue


MAX_RECENT_WORDS

private static final int MAX_RECENT_WORDS
How large to make the cache of recently added words

See Also:
Constant Field Values

recentWords

private HashMap<String,Integer> recentWords
For counting word frequencies prior to write


MAX_RECENT_PAIRS

private static final int MAX_RECENT_PAIRS
Max # of pairs to hash before flushing

See Also:
Constant Field Values

recentPairs

private HashMap<String,Integer> recentPairs
For counting pair frequencies prior to write


DEFAULT_MIN_WORD_FREQ

private static final int DEFAULT_MIN_WORD_FREQ
Default minimum word frequency = 2

See Also:
Constant Field Values

minWordFreq

private int minWordFreq
Minimum frequency for words to retain


DEFAULT_MIN_PAIR_FREQ

private static final int DEFAULT_MIN_PAIR_FREQ
Default minimum pair frequency = 2

See Also:
Constant Field Values

minPairFreq

private int minPairFreq
Minimum frequency for pairs to retain


doubleMetaphone

private static DoubleMetaphone doubleMetaphone
Used for calculating double metaphone keys


splitPat

Pattern splitPat
Used for splitting lines delimited with bar


SORT_MEM_LIMIT

private int SORT_MEM_LIMIT
Memory limit for sorting


keyChars

private char[] keyChars
Character array for forming combo keys


edmapBuf

private StringBuffer edmapBuf
String buffer for edmap pairs

Constructor Detail

SpellWriter

private SpellWriter()
Private constructor -- do not construct directly; rather, use the static open(File) method.

Method Detail

open

public static SpellWriter open(File spellIndexDir)
                        throws IOException
Creates a SpellWriter, and establishes the directory to store the dictionary in. If you want stop-words to be recognized and discarded (especially important if the dictionary will be large), call setStopwords(Set) after opening a writer. The minimum word frequency defaults to 2; if you want to override that, call setMinWordFreq(int). A similar threhold exists for pairs; the minimum pair frequency defaults to 2; if you want to override that, call setMinPairFreq(int).

Parameters:
spellIndexDir - Directory in which to store the spelling dictionary
Throws:
IOException

openInternal

private void openInternal(File spellIndexDir)
                   throws IOException
Establishes the directory to store the dictionary in.

Throws:
IOException

setStopwords

public void setStopwords(Set set)
Establishes a set of stop words (e.g. "the", "and", "a", etc.) to receive special handling. This can significantly decrease the size of the dictionary.

Parameters:
set - the set of stop words to use

setMinWordFreq

public void setMinWordFreq(int freq)
Establish a minimum word frequency. When the in-memory cache is flushed to disk (every 20,000 words or so) those with a frequency below this threshold will be discarded; those at or above this threshold will be written to the disk queue.

Parameters:
freq - the new minimum word frequency

setMinPairFreq

public void setMinPairFreq(int freq)
Establish a minimum pair frequency. When the in-memory cache is flushed to disk (every 200,000 pairs or so) those with a frequency below this threshold will be discarded; those at or above this threshold will be written to the disk queue.

Parameters:
freq - the new minimum pair frequency

close

public void close()
           throws IOException
Closes all files. Does NOT write queued words (they stay queued on disk.)

Throws:
IOException

clearDictionary

public void clearDictionary()
                     throws IOException
Delete all words in the dictionary (including those queued on disk)

Throws:
IOException

queueWord

public void queueWord(String word)
               throws IOException
Queue the given word. The queue can later be flushed by calling flushQueuedWords(); this is typically put off until the end of an indexing run.

Throws:
IOException

queueBreak

public void queueBreak()
Called to signal a break in the text, to inform the spell checker to avoid pairing the previous word with the next one. This should be called at the start or end of a section or field, and at the start or end of each sentence.


flushRecentPairs

private void flushRecentPairs()
                       throws IOException
Flush any accumulated pairs, with their counts. For efficiency, skip any pair that appeared only once.

Throws:
IOException

flushRecentWords

private void flushRecentWords()
                       throws IOException
Flush any accumulated words, with their counts.

Throws:
IOException

anyWordsQueued

public boolean anyWordsQueued()
                       throws IOException
Check if any words are queued for add.

Throws:
IOException

flushQueuedWords

public void flushQueuedWords()
                      throws IOException
Ensures that all words in the queue are written to the dictionary on disk. Note that this can take quite some time; if you want to print out progress messages during the process, use flushQueuedWords(ProgressTracker) below.

Throws:
IOException

flushQueuedWords

public void flushQueuedWords(ProgressTracker prog)
                      throws IOException
Ensures that all words in the queue are written to the dictionary on disk.

Parameters:
prog - A tracker that will be called periodically during the process; generally you'll want to supply one that prints out progress messages. If null, no progress will be reported.
Throws:
IOException

flushPhase1

private void flushPhase1(ProgressTracker prog)
                  throws IOException
Performs the word-adding phase of the flush procedure.

Throws:
IOException - if something goes wrong

readFreqs

private void readFreqs(File inFile,
                       FileSorter out,
                       ProgressTracker prog)
                throws IOException
Read an existing frequency file, and add it to a file sorter.

Throws:
IOException

writeFreqs

private void writeFreqs(File outFile,
                        FileSorter freqSorter,
                        IntList allFreqs,
                        FileSorter edmapSorter,
                        ProgressTracker prog)
                 throws IOException
Write out frequency data, in sorted order.

Throws:
IOException

addCombos

private void addCombos(String word,
                       FileSorter edMapSorter)
                throws IOException
Add combinations of the first six letters of the word, capturing all the possibilities that represent an edit distance of 2 or less.

Throws:
IOException

addCombo

private void addCombo(String word,
                      FileSorter edmapSorter,
                      int p0,
                      int p1,
                      int p2,
                      int p3)
               throws IOException
Add a combination of letters to the edit map

Throws:
IOException

comboKey

private char[] comboKey(String word,
                        int p0,
                        int p1,
                        int p2,
                        int p3)
Calculate a key from the given characters of the word.


comboChar

private char comboChar(char c)

writeFreqSamples

private void writeFreqSamples(IntList allFreqs,
                              File file,
                              ProgressTracker prog)
                       throws IOException
Write term frequency samples to the given file.

Throws:
IOException

writeEdMap

private void writeEdMap(FileSorter edmapSorter,
                        File outFile,
                        ProgressTracker prog)
                 throws IOException
Write out a prefix-compressed edit-distance map, which also contains term frequencies.

Throws:
IOException

condenseEdmapKey

private void condenseEdmapKey(String key,
                              ArrayList<String> words,
                              Writer out)
                       throws IOException
Perform prefix compression on a list of words for a single edit map key.

Throws:
IOException

deleteFile

private void deleteFile(File file)
                 throws IOException
Attempt to delete (and at least truncate) the given file.

Throws:
IOException

replaceFile

private void replaceFile(File oldFile,
                         File newFile)
Replace an old file with a new one


flushPhase2

private void flushPhase2(ProgressTracker prog)
                  throws IOException
Performs the pair-adding phase of the flush procedure.

Throws:
IOException

openWordQueueWriter

private void openWordQueueWriter()
                          throws IOException
Opens the word queue writer.

Throws:
IOException

openPairQueueWriter

private void openPairQueueWriter()
                          throws IOException
Opens the pair queue writer.

Throws:
IOException

closeQueueWriters

private void closeQueueWriters()
                        throws IOException
Closes the queue writers if either are open

Throws:
IOException

calcMetaphone

public static String calcMetaphone(String word)

finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable