[ You are here:
XTF ->
Under the Hood -> Hit Scoring ]
Hit Scoring
XTF uses and extends Lucene's built-in scoring mechanism to provide a relevance score for each hit, and to return hits in ranked order (i.e. highest score first.) This section describes briefly how the
Text Engine determines the score for hits in the full text and hits on meta-data fields.
You can observe the scoring engine in action by enabling the
explainScores attribute on the
<query> element produced by your
Query Parser stylesheet. See the
Tag Reference for more information on how to enable this.
The following sections break down XTF's scoring calculation like this: first we cover the common aspects shared by both meta-data and text chunk scoring, then talk about the differences, and finally take a look at how the final combined score for a document is computed.
Individual Hit Scoring
Whether a hit (another name for a single match) is in a meta-data field or within the full text of a document, the scoring for that particular hit is the same. How the scores are combined differs, and those differences are covered in later sections.
For those intimately familiar with Lucene, it will be helpful to know that XTF makes extensive use of Lucene's "span" queries, to enable the exact identification of particular matches within a large document. XTF's implementation of spans includes enhancements that calculate the score of each span in addition to its "slop".
Queries on the contents of an XTF index are scored using an enhanced version of Lucene's standard formula. The structure of the scoring formula is fixed, but one can override the calculation of the various factors by providing a Java implementation of the
Similarity interface.
Plain English
- If a term appears many times in a field or chunk of a document, the match will rank higher; if it only appears a few times, the match will rank lower.
- Rare terms are given more weight than common terms.
- Any field or section of text in the document can be boosted at index time; hits in boosted sections will rank higher.
- For AND and NEAR queries, more exact matches will rank higher than sloppy matches. That is, a hit where the terms appear in order without intervening words will rank higher than an out-of-order match with many other words interspersed.
- Matches in short fields are ranked higher than those in long fields.
- In an OR query, hits that match many of the terms will rank higher than those that only match a few.
Mathematical Details
For a given query
q, the score for a matching span
s consisting of terms
t, in field (or text chunk)
f of document
d, is calculated as follows:
- spanScore(q,s) = sloppyFreq(s) * boost(f,d) * lengthNorm(f,d) * coord(q,s) * (sum for t in s: idf(t))
where
- sloppyFreq(s) is a factor that decreases as the sloppiness of the matching span increases. The effect is to favor more exact matches. Default implementation: 1 / (slop + 1). In the special case of OR queries, or AND queries where proximity has been disabled, slop is ignored and 1.0 is used instead.
- boost(f,d) is the boost factor (if any) applied to the field (or section of text) containing the match.
- lengthNorm(f,d) is a factor that decreases as the amount of text in the field or text chunk increases, since matches in longer fields are generally less precise. Default implementation: square root of the number of terms in the field/chunk.
- coord(q,s) is a factor based on the fraction of all query terms that are matched by the span. Spans with a higher ratio of matching terms will be ranked higher. Default implementation: number of terms matched / number of terms in query.
- idf(t) is a factor based on the number of documents or text chunks containing t. Terms that occur in fewer documents/chunks are better indicators of topic, so idf is high when t appears seldom in the index, and low when t appears often in the index. Default implementation: log(number of docs or chunks in index / number of docs or chunks containing t)
Text Hit Scoring
The full text of a document might contain thousands of individual matching spans, each of which will be scored according to the method above. How are these scores combined into a single score for the text?
Plain English
- Documents with more matches will score higher than those with few matches.
- Matches in short texts are ranked higher than those in long texts.
Mathematical Details
For a given query
q, the score for all matching spans
s in all text chunks of document
d is calculated as follows:
- textScore(q,d) = lengthNorm(d,text) * tf( sum for s in d: spanScore(q,s) )
where
- lengthNorm(d,text) is a factor that decreases as the amount of text in the document increases, since matches in longer texts are generally less precise. Default implementation: square root of the number of chunks in the document.
- tf(...) is a score factor that helps to equalize the scoring between documents with many matches and those with few matches. Default implementation: square root of the total score of the matches in d.
Meta-data Hit Scoring
The scores for multiple hits within a single meta-data field are combined in a similar manner to text hits, above.
Plain English
- Fields with more matches will score higher than those with few matches.
- Matches in short fields are ranked higher than those in long fields.
Mathematical Details
For a given query
q, the score for all matching spans
s in field
f of document
d is calculated as follows:
- metaScore(q,d,f) = lengthNorm(d,f) * tf( sum for s in d.f: spanScore(q,s) )
where
- lengthNorm(d,f) is a factor that decreases as the length of the field increases, since matches in longer fields are generally less precise. Default implementation: square root of the number of terms in the field.
- tf(...) is a score factor that helps to equalize the scoring between fields with many matches and those with few matches. Default implementation: square root of the total score of the matches in d.f.
Combined Document Score
The final type of scoring
XTF performs is to combine the scores of all text hits with a document with that document's meta-data scores, to form the final score for that document. Again, the structure of this computation is fixed, but the calculations can be overridden by providing a Java implementation of the
Similarity interface.
Plain English
- A document's score is based on the sum of all the text hits within it, plus its meta-data field scores.
Mathematical Details
For a given query
q consisting of meta-data queries
qmf on fields
f, and a text query
qt, the score for a specific document
d is as follows:
- docScore(q,d) = textScore(qt,d) + (sum for f in d: metaScore(qmf,d,f))
where
metaScore and
textScore are computed as outlined in the previous two sections.