org.cdlib.xtf.textEngine
Class MoreLikeThisQuery

Object
  extended by Query
      extended by MoreLikeThisQuery
All Implemented Interfaces:
Serializable, Cloneable

public class MoreLikeThisQuery
extends Query

Processes the sub-query and uses the first document as the "target". Then we determine the most "interesting" terms in the target document, and finally perform a query on those terms to find more like the target. The target document itself will NOT be included in the results.

See Also:
Serialized Form

Nested Class Summary
private static class MoreLikeThisQuery.Flt
          Used for scores and to avoid renewing Floats.
private static class MoreLikeThisQuery.Int
          Used for frequencies and to avoid renewing Integers.
 class MoreLikeThisQuery.MoreLikeWrapper
          Exclude the target document from the set.
private static class MoreLikeThisQuery.QueryWord
           
private static class MoreLikeThisQuery.QueryWordQueue
          PriorityQueue that orders query words by score.
 
Field Summary
private  CharMap accentMap
           
private  boolean boost
          Should we apply a boost to the Query based on the scores?
private  Map boostMap
          Boost values for the fields
private  float[] fieldBoosts
          Boost value per field.
private  String[] fieldNames
          Field name(s) we'll analyze.
private  int maxDocFreq
          Ignore words which occur in at least this many docs.
private  int maxNumTokensParsed
          The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
private  int maxQueryTerms
          Don't return a query longer than this.
private  int maxWordLen
          Ignore words if greater than this len.
private  int minDocFreq
          Ignore words which do not occur in at least this many docs.
private  int minTermFreq
          Ignore words less freqent that this.
private  int minWordLen
          Ignore words if less than this len.
private  WordMap pluralMap
           
private  Similarity similarity
          For idf() calculations.
private  Set stopSet
           
private  Query subQuery
           
private  int targetDoc
           
 
Constructor Summary
MoreLikeThisQuery(Query subQuery)
          Constructs a span query selecting all terms greater than lowerTerm but less than upperTerm.
 
Method Summary
private  void addTermFrequencies(TokenStream tokens, String field, Map termFreqMap)
          Adds term frequencies found by tokenizing text from reader into the Map words.
private  Map condenseTerms(IndexReader indexReader, Map words)
          Condense the same term in multiple fields into a single term with a total score.
private  Query createQuery(IndexReader indexReader, PriorityQueue q)
          Create the More like query from a PriorityQueue
private  PriorityQueue createQueue(IndexReader indexReader, Map words)
          Create a PriorityQueue from a word->tf map.
 float[] getFieldBoosts()
           
 String[] getFieldNames()
           
 Query getSubQuery()
          Retrieve the sub-query
protected  boolean isNoiseWord(String term)
          Determines if the passed term is likely to be of interest in "more like" comparisons
private  PriorityQueue retrieveTerms(IndexReader indexReader, int docNum, Analyzer analyzer)
          Find words for a more-like-this query former.
 Query rewrite(IndexReader reader)
          Generate a query that will produce "more documents like" the first in the sub-query.
 void setAccentMap(CharMap map)
          Establish the accent map in use
 void setBoost(boolean boost)
          Should we apply a boost to the Query based on the scores?
 void setFieldBoosts(float[] fieldBoosts)
          Boost value per field
 void setFieldNames(String[] fieldNames)
          Field name(s) we'll analyze.
 void setMaxDocFreq(int maxDocFreq)
          Ignore words which occur in at least this many docs.
 void setMaxNumTokensParsed(int maxNumTokensParsed)
          The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
 void setMaxQueryTerms(int maxQueryTerms)
          Don't return a query longer than this.
 void setMaxWordLen(int maxWordLen)
          Ignore words if greater than this len.
 void setMinDocFreq(int minDocFreq)
          Ignore words which do not occur in at least this many docs.
 void setMinTermFreq(int minTermFreq)
          Ignore words less freqent that this.
 void setMinWordLen(int minWordLen)
          Ignore words if less than this len.
 void setPluralMap(WordMap map)
          Establish the plural map in use
 void setStopWords(Set set)
          Establish the set of stop words to ignore
 void setSubQuery(Query subQuery)
          Set the sub-query
 String toString(String field)
          Prints a user-readable version of this query.
 
Methods inherited from class Query
clone, combine, createWeight, extractTerms, getBoost, getSimilarity, mergeBooleanQueries, setBoost, toString, weight
 
Methods inherited from class Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

subQuery

private Query subQuery

targetDoc

private int targetDoc

stopSet

private Set stopSet

pluralMap

private WordMap pluralMap

accentMap

private CharMap accentMap

minTermFreq

private int minTermFreq
Ignore words less freqent that this.


minDocFreq

private int minDocFreq
Ignore words which do not occur in at least this many docs.


maxDocFreq

private int maxDocFreq
Ignore words which occur in at least this many docs.


boost

private boolean boost
Should we apply a boost to the Query based on the scores?


fieldNames

private String[] fieldNames
Field name(s) we'll analyze.


fieldBoosts

private float[] fieldBoosts
Boost value per field.


boostMap

private Map boostMap
Boost values for the fields


maxNumTokensParsed

private int maxNumTokensParsed
The maximum number of tokens to parse in each example doc field that is not stored with TermVector support


minWordLen

private int minWordLen
Ignore words if less than this len.


maxWordLen

private int maxWordLen
Ignore words if greater than this len.


maxQueryTerms

private int maxQueryTerms
Don't return a query longer than this.


similarity

private Similarity similarity
For idf() calculations.

Constructor Detail

MoreLikeThisQuery

public MoreLikeThisQuery(Query subQuery)
Constructs a span query selecting all terms greater than lowerTerm but less than upperTerm. There must be at least one term and either term may be null, in which case there is no bound on that side, but if there are two terms, both terms must be for the same field. Applies a limit on the total number of terms matched.

Method Detail

getSubQuery

public Query getSubQuery()
Retrieve the sub-query


setSubQuery

public void setSubQuery(Query subQuery)
Set the sub-query


setStopWords

public void setStopWords(Set set)
Establish the set of stop words to ignore


setPluralMap

public void setPluralMap(WordMap map)
Establish the plural map in use


setAccentMap

public void setAccentMap(CharMap map)
Establish the accent map in use


setMaxDocFreq

public void setMaxDocFreq(int maxDocFreq)
Ignore words which occur in at least this many docs.


setFieldNames

public void setFieldNames(String[] fieldNames)
Field name(s) we'll analyze.


getFieldNames

public String[] getFieldNames()

setFieldBoosts

public void setFieldBoosts(float[] fieldBoosts)
Boost value per field


getFieldBoosts

public float[] getFieldBoosts()

setMaxNumTokensParsed

public void setMaxNumTokensParsed(int maxNumTokensParsed)
The maximum number of tokens to parse in each example doc field that is not stored with TermVector support


setMaxQueryTerms

public void setMaxQueryTerms(int maxQueryTerms)
Don't return a query longer than this.


setMaxWordLen

public void setMaxWordLen(int maxWordLen)
Ignore words if greater than this len.


setMinDocFreq

public void setMinDocFreq(int minDocFreq)
Ignore words which do not occur in at least this many docs.


setMinTermFreq

public void setMinTermFreq(int minTermFreq)
Ignore words less freqent that this.


setMinWordLen

public void setMinWordLen(int minWordLen)
Ignore words if less than this len.


setBoost

public void setBoost(boolean boost)
Should we apply a boost to the Query based on the scores?


rewrite

public Query rewrite(IndexReader reader)
              throws IOException
Generate a query that will produce "more documents like" the first in the sub-query.

Overrides:
rewrite in class Query
Throws:
IOException

createQuery

private Query createQuery(IndexReader indexReader,
                          PriorityQueue q)
                   throws IOException
Create the More like query from a PriorityQueue

Throws:
IOException

createQueue

private PriorityQueue createQueue(IndexReader indexReader,
                                  Map words)
                           throws IOException
Create a PriorityQueue from a word->tf map.

Parameters:
words - a map of words keyed on the word(String) with Int objects as the values.
Throws:
IOException

condenseTerms

private Map condenseTerms(IndexReader indexReader,
                          Map words)
                   throws IOException
Condense the same term in multiple fields into a single term with a total score.

Parameters:
words - a map of words keyed on the word(String) with Int objects as the values.
Throws:
IOException

retrieveTerms

private PriorityQueue retrieveTerms(IndexReader indexReader,
                                    int docNum,
                                    Analyzer analyzer)
                             throws IOException
Find words for a more-like-this query former.

Parameters:
docNum - the id of the lucene document from which to find terms
Throws:
IOException

addTermFrequencies

private void addTermFrequencies(TokenStream tokens,
                                String field,
                                Map termFreqMap)
                         throws IOException
Adds term frequencies found by tokenizing text from reader into the Map words.

Parameters:
tokens - a source of tokens
field - Specifies the field being tokenized
termFreqMap - a Map of terms and their frequencies
Throws:
IOException

isNoiseWord

protected boolean isNoiseWord(String term)
Determines if the passed term is likely to be of interest in "more like" comparisons

Parameters:
term - The word being considered
Returns:
true if should be ignored, false if should be used in further analysis

toString

public String toString(String field)
Prints a user-readable version of this query.

Specified by:
toString in class Query