[ You are here:
XTF ->
Under the Hood -> Spelling Correction ]
Spelling Correction (under the hood)
Everybody likes Google's "did you mean" suggestions. Users often misspell words when they're querying an
XTF index, and it would be nice if the system could catch the most obvious errors and automatically suggest an appropriate spelling correction. The XTF team did extensive work to create a fast and accurate facility for doing this, involving minimal work for those deploying XTF.
In the following sections, we'll discuss the guts of XTF's spelling correction system, detailing some strategies that were considered and the final strategy selected, the methods that XTF uses to come up with high-quality suggestions, and how the dictionary is created and stored. If you're looking for information on configuring and using the system, see the
Programming Guide.
Choosing an Index-based Strategy
We considered three strategies for spelling correction, each deriving suggestions from a different kind of source data.
- Fixed Dictionary: Available software, such as GNU Aspell, makes spelling corrections based on one or more fixed, pre-constructed dictionaries. But the bibliographic data in our test bed and our sample queries from a live university library catalog were multilingual and contained a substantial proportion of proper nouns (e.g. names of people or places). This ruled out a fixed dictionary, since many of these proper nouns and foreign words wouldn't be present in any standard dictionary.
- Query Logs: In this method, we would compare the user's query to an extensive set of prior queries, and suggest alternative queries which resulted in more hits. Unfortunately, our test system had a limited number of users, and there was no feasible way to get hold of the millions of representative queries this strategy would require, so it was ruled out as well.
- Index-based Dictionary: Here we would make suggestions using an algorithm that draws on terms and term frequency data from the XTF index itself. It resembles the first approach except that the dictionary is dynamically derived from the actual documents and their text.
Because of the issues identified with the other strategies, we opted to pursue the index-based approach. We feel it is best for most collections in most situations, as it adapts to the documents most germane to the application and users, and doesn't require a long query history to become effective.
We set ourselves a goal of getting the correct suggestion in the #1 position 80% of the time, which seemed a good threshold for the system to be truly useful. With several iterations and many tweaks to the algorithm, we were able to achieve this goal for our sample data set and our sample queries (drawn from real query logs). We have a fair degree of confidence that the algorithm is quite general and should be applicable to many other data sets and types of queries.
Correction Algorithm
XTF builds a spelling correction dictionary at index-time. If a query is sent to
crossQuery and results in only a small number of hits (the threshold is configurable), XTF consults the dictionary to make an alternative spelling suggestion. Here is a brief outline of the algorithm for making a suggestion:
- For each word in the query, make a list of candidate suggestions.
- Rank the suggestions and pick the best one (details below).
- For multi-word queries, consider pair frequencies to improve quality.
- Suppress near-identical suggestions, and ensure that the final suggestion increases the number of hits.
Each of these will be considered in more detail below.
Which Words to Consider?
This algorithm relies (as most other spelling algorithms do) on a sort of "shot-gun" approach: for each potentially misspelled word in the query, they make a long list of "candidate" words, that is, words that could be suggested as a correction for the original misspelled word. Then the list of candidates is ranked using some sort of scoring formula, and finally the top-ranked candidate is presented to the user.
One might naively attempt to scan
every word in the dictionary as a candidate. Unfortunately, the cost of scoring each one becomes prohibitive when the dictionary grows beyond about ten thousand words. So a strategy is needed to quickly come up with a list of a few hundred pretty good candidates. XTF uses a novel approach that gives good speed and accuracy.
We began with a base of existing Java spelling correction code that had been contributed to the Lucene project (written by Nicolas Maisonneuve, based on code originally contributed by David Spencer). The base Lucene algorithm first breaks up the word we're looking for into 2, 3, or 4-character
n-grams (for instance, the word primer might end up as:
~pri prim rime imer mer~ ). Next, it performs a Lucene OR query on the dictionary (also built with n-grams), retaining the top 100 hits (where a hit represents a correctly spelled word that shares some n-grams with the target word). Finally, it ranks the suggestions according to their
edit distance to the original misspelled word. Those that are closest appear at the top. ("Edit distance" is a standard numerical measure of how far two words are from each other and is defined as the number of insert, replace, and delete operations needed to transform one word into the other.)
Unfortunately, the base method was quite slow, and often didn't find the correct word, especially in the case of short words. However, we made one critical observation: In perhaps 85-90% of the cases we examined, the correct word had an edit distance of 1 or 2 from the misspelled query word; another 5-10% had an edit distance of 3, and the extra edit was usually toward the end of the word. This observation intuitively rings true: the suggested word should be relatively "close" to the query word and those that are "far" away needn't even be considered.
Still, one wouldn't want to consider all possible one- and two-letter edits as it would still take a long time to check if all those possible edits were actually in the dictionary. Instead, XTF checks all words in which four of the first six characters match in order. This effectively checks for an edit distance of two or less at the start of the word.
Take for example the word
GLOBALISM. Here are the 15 keys that can be created by deleting two of the first six characters:
- OBAL (deleted characters 1 and 2)
- LBAL (deleted characters 1 and 3)
- LOAL (deleted characters 1 and 4)
- LOBL (deleted characters 1 and 5)
- LOBA (deleted characters 1 and 6)
- GBAL (deleted characters 2 and 3)
- GOAL (deleted characters 2 and 4)
- GOBL (deleted characters 2 and 5)
- GOBA (deleted characters 2 and 6)
- GLAL (deleted characters 3 and 4)
- GLBL (deleted characters 3 and 5)
- GLBA (deleted characters 3 and 6)
- GLOL (deleted characters 4 and 5)
- GLOA (deleted characters 4 and 6)
- GLOB (deleted characters 5 and 6)
So, XTF checks each of the 15 possible 4-letter keys for a given query word, and makes a merged list of all the words that share those same keys. This combined list is usually only a few hundred words long, and almost always contains within it the golden "correct" word we're looking for.
Ranking the Candidates
Given a list of candidate words, how does one find the "right" suggestion? This is the heart of most spelling algorithms and the area that needs the most "tweaking" to achieve good results. XTF's ranking algorithm is no exception, and makes decisions by assigning a score to each candidate word. The score is a sum of the following factors:
- Edit distance. The base score is calculated from the edit distance between the query word and the candidate word. The usual edit distance algorithm (which considers only insertion, deletion, and replacement) is supplemented by reducing the "cost" of transposition and double-letter errors. Those types of errors are very common spelling mistakes, and reducing their cost was shown by our testing to improve suggestion accuracy. In math terms, the base score is calculated as:
1.0 - (editDistance / queryWord.length).
- Metaphone. Originally developed by Lawrence Philips, the Double Metaphone algorithm is a simple phonetic mapping that transforms any English word into a 4 character code (called a "metaphone"). Words that have the same code are assumed to be phonetically similar. In XTF's spelling system, if the metaphone of the candidate word matches that of the query word, the score is nudged upward by 0.1. (Note: Philips' algorithm actually produces two codes, primary and secondary, for a given input word; XTF only compares the primary code.)
- First/last letter. If the first and last letters of the candidate word match those of the query word, then the score is nudged upward by 0.1. Again, testing showed that this approach boosted overall accuracy.
- Frequency. Because the indexes do contain misspelled words, we further boost the score if the suggested word is very common. The boost ranges from 0.01 to 0.2, and is arrived at by a slightly complex method, but basically more frequent candidate words get higher boosts. This has the effect of favoring common words over uncommon words (other factors being equal), which helps to avoid ridiculous suggestions.
The original query word itself is always considered as one of the candidates. This is to reduce the problem of suggesting replacements for correctly spelled words. However, the score of the query word is reduced slightly in case a more common word is found that is very similar.
In summary, for a single-word query, the list of all candidates is ranked by score, and the one with the highest score wins and is considered the "best" suggestion for the user.
Multi-word Correction
But what about queries with multiple words? Testing showed that considering each word of a phrase independently and concatenating the result got plenty of important cases wrong. For instance, consider the query
untied states. Actually, each of these words is spelled correctly, but it's clear the user probably meant to query for
united states. Also, consider the word
harrypotter... the best single-word replacement might be
hairy, but that's not what the user meant. How do we go beyond single-word suggestions?
We need to know more than just the frequency of the words in the index; we need to know how often they occur
together. So when XTF builds the spelling dictionary, it additionally tracks the frequency of each pair of words as they occur in sequence.
Using the pair frequency data, we can take a more sophisticated approach to multi-word correction. Specifically, XTF tries the following additional strategies to see if they yield better suggestions:
- For each single word (e.g. harrypotter) we consider splitting it up into pairs and see if any of those pairs has a high frequency. For example we'd check <har rypotter>, <harr ypotter>, <harry potter>, etc. and get a solid hit on that last one. That "hitting" takes the form of a higher score boost based on the pair frequency.
- For each consecutive pair of words, we fetch the top 100 single-word suggestions for both words in the pair. Then we consider all possible combinations of a word from the first list vs. a word from the second list (that's 10,000 combinations) to see if any of the combinations are high-frequency pairs. So in the example of <untied states>, one of the suggestions for untied will certainly be united and we'll discover that <united states> as a pair has a high frequency. So we boost the score for united, and it ends up winning out against the other candidates.
- Finally, we consider joining each pair of words together. So if a user accidentally queries for <uni lateralism>, we have a good chance of suggesting unilateralism.
All of these strategies are attempted and, just as in single-word candidate selection, each phrase candidate is scored and ranked. The highest scoring phrase wins.
Sanity Checks
Despite all the above efforts, sometimes the spelling system makes bad suggestions. A couple of methods are used to minimize these.
First, a filter is applied to avoid suggesting "equivalent" words. In XTF, the indexer performs several mapping functions, such as converting plural words to singular words, and mapping accented characters to their unaccented equivalents. It would be silly for the spelling system to suggest
cat if the user entered
cats, even if
cat would normally be considered a better suggestion because it has higher frequency. The final suggestion is checked, and if it's equivalent (i.e. maps to the same words) to the user's initial query, no suggestion is made.
Second, it's quite possible for the spelling correction system to make a suggestion that will yield fewer results than the user's original query. While this isn't common, it happens often enough that it could be annoying. So after the spelling system comes up with a suggestion, XTF automatically runs the resulting modified query, and if fewer results are obtained, the suggestion is simply suppressed.
Dictionary Creation
Now we turn to the dictionary used by the algorithm above to produce suggestions. How is it created during the XTF indexing process? Let's find out.
Incremental Approach
One of the best features of the XTF indexer is incremental indexing, the ability to quickly add a few documents to an existing index. So we needed an incremental spelling dictionary build process to stay true to the indexer's design. To do this we phase the work, with a low-overhead collection pass added to the main indexing process, and then a phase of intensive dictionary generation, optimized as much as possible.
During the main index run, XTF simply collects information on which words are present, their frequencies, and the frequency of pairs. Data is collected in a fairly small RAM cache and periodically sorted and written to disk. Two filters assure that we avoid accumulating counts for rare words and rare word pairs. Words that occur only once or twice are disregarded (though this limit is configurable); likewise, pairs that occur only once or twice are not written to disk. Collection adds minimal CPU overhead to indexing.
Then comes the dictionary creation phase, which processes the queued word and pair counts to form the final dictionary (this can optionally be delayed until another index run). Here are the processing steps:
- Read in the list of words and their frequencies, merging any existing list with new words added since the last dictionary build.
- Sort the word list and sum up frequency counts for identical words.
- Sample the frequency distribution of the words. This enables the correction algorithm to quickly decide how much boost a given word should receive by ranking its frequency in relation to the rest of the words in the dictionary.
- Create the "edit map" file. For each word, calculate each of the 15 edit keys, and add the word to the list for that key. Key calculation was discussed above. This edit map file is created and then sorted on disk with lists for duplicate entries being merged together.
- Finally, any existing pair data from a previous dictionary creation run is read in and then new pairs are added from the most recent main index phase. Unlike the other parts of dictionary creation, this work is all done in RAM, for reasons outlined in the next section covering data structures.
Data Structures
The data structures used to store the dictionary are motivated by the needs of the spelling correction algorithm. In particular, it needs the following information to make good suggestions:
- The list of words that are "similar" to a given word (i.e. whose beginning is an edit distance of 2 or less.)
- The frequency of a given candidate word (i.e. how often that word occurs within the documents in an index.)
- The frequency of a given pair of words (i.e. how often the two words appear together, in sequence, in the index.)
To support these needs, the dictionary consists of three main data structures, each of which is discussed below.
Edit Map
Since at most 15 keys need to be read for a given input word, this data structure is mainly disk-based (it's never read entirely into RAM.) The disk file consists of one line per 4-letter key, listing all the words that share that key. At the end of the file is an index giving the length (in bytes) for each key, so that the correction engine can quickly and randomly access the entries.
The words in each list are
prefix-compressed to conserve disk space. This is a way of compressing a list of words when many of them share prefixes. For example, say we have three words in the list for a key:
apple application aplomb
We always store the first word in its entirety; each word after that is stored as the number of characters it has in common with the previous word, plus the characters it doesn't share. For long lists of similar words, the compression becomes quite significant. In our example the compressed list is:
Here are some lines from a real edit map file, with the keys in
bold:
- abrd|aboard|2road|2surd|6ist|7ty|6ly
- abre|abbreviated|9ions|0fabre|0laborer|7s
- abrg|aboriginal
- abrh|abraham|7son
- abri|aboriginal|4tion|8s|0cambridge|0fabric|6ated|0labyrinth
- ... and the random-access index at the end of the file ...
- abrd|37
- abre|42
- abrg|16
- abrh|18
- abri|61
Word Frequency Table On disk this is stored as a simple text file with one line per word giving its frequency. The lines are sorted in ascending word order. This structure is read completely into RAM by the correction engine, as we need to potentially evaluate tens of thousands of candidate words per second in a high-volume application.
Here are some lines from a real word frequency file:
- aboard|10
- abolished|19
- abolishing|9
- abolish|8
- abolition|34
- abomination|6
Pair Frequency Table
The correction engine needs to check the frequency of hundreds of thousands of word pairs per second. This implies a need for extremely fast access, so we need to pull the entire data structure into RAM and search it very quickly.
The table exists only in binary (rather than text) form, in a very simple structure. A "hash code" is computed for each pair of words. The hash code is a 64-bit integer that does a good job of characterizing the pair; for two different pairs, the chance of getting the same hash code is vanishingly small. The structure consists of a large array of the all the pairs' hash codes, sorted numerically, plus a frequency count per pair. This sorted data is amenable to a fast in-memory binary search.
Here's the disk layout:
# bytes |
Description |
8 |
Magic number (identifies this as a pair frequency file) |
4 |
Total number of pairs in the file |
8 |
Hash code of pair 1 |
4 |
... and frequency of pair 1 |
8 |
Hash code of pair 2 |
4 |
... and frequency of pair 2 |
8 |
Hash code of pair 3 |
... |
etc. |
As you can see, we store each pair with exactly 12 bytes: 8 bytes for the 64-bit hash code, and 4 bytes for the 32-bit count. Working with fixed-size chunks makes the code simple, and also keeps the pair data file (and corresponding RAM footprint) relatively small.