[ You are here:
XTF ->
Under the Hood -> Query Operations ]
Query Operations and What They Do
This section gives details on how queries are interpreted, and specifies how the various query operators work. Note that meta-data and text queries are treated somewhat differently. This is due to the fact that meta-data fields are assumed to be short, while the full text of a document is assumed to be very large.
Interpreting User Queries
The job of translating queries from an input URL to a form that
XTF can understand is undertaken by the Query Parser stylesheet. The query parser included in the base distribution is relatively simple: by default, it simply forms an AND query consisting of all the input terms. All non-word terms (such as the "++" in "C++") are ignored. Optionally, the operation can be changed to OR or NEAR. In addition, terms can be excluded if desired.
Here are some sample URLS, and how the default query parser interprets each one:
http://yourserver:8080/xtf/search?title=apartheid+mind
- Interpreted as: title:(apartheid AND mind)
- (Note that "+" is the URL coding for a space.)
http://yourserver:8080/search?title=apartheid+mind;title-exclude=mandela
- Interpreted as: title:((apartheid AND mind) NOT (mandela))
http://yourserver:8080/search?title=apartheid+mind;title-join=or
- Interpreted as: title:(apartheid OR mind)
http://yourserver:8080/search?title=apartheid;text=mandela
- Interpreted as: (title:apartheid) AND (text:mandela)
http://yourserver:8080/search?text=%22Nelson+Mandela%22;subject=africa
- Interpreted as: (text:PHRASE "nelson mandela") AND (subject:africa)
- (Note that "%22" is the URL coding for a quote character.)
http://yourserver:8080/search?text=Mandela+Apartheid;text-join=5
- Interpreted as: (text:Mandela NEAR{proximity=5} Apartheid)
Of course much more complex query parsing is possible, since the
Text Engine can handle arbitrarily complex queries consisting any combination of boolean query operators. Creating such a system is, however, left to the system designer setting up
XTF, as it intermeshes closely with whatever HTML form or other mechanism is used to input the query, and is highly dependent on the skill level and needs of the final users of the system.
Text Query Operations
XTF implements a full complement of "boolean" operators used to form complex queries: AND, OR, NEAR, PHRASE, RANGE, and NOT, and supports wildcard search characters. This section covers the details of how these queries are interpreted within the very large documents
XTF can handle.
TERM and Wildcard Queries on Text
A
TERM query matches every occurrence of the specified term in the document. Upper-case vs. lower-case distinctions are ignored. Additionally, the term may contain special wildcard characters:
? |
The question-mark character matches terms with any character at that position. For example, lo?e would match love or lose, but not loe. |
* |
The asterisk character matches terms with any number (including zero) characters at that position. For example, dog* would match any of the following terms: dog, dogs, doggie, doggerel, etc. |
Depending on the particular wildcard, hundreds or even thousands of terms might match, so care should be taken when using these. To avoid allowing such queries to occupy the engine for long periods of time,
XTF allows queries to specify a limit on the maximum number of terms to match (controlled by the
workLimit attribute.) Queries that exceed the limit produce an error.
AND Query on Text
What does it mean to search for "man AND war" in the full text of all documents? Perhaps the most obvious answer would be to search for any document containing both words. But consider a document where "man" appeared in Chapter 1 and "war" appeared in Chapter 7. Would that be a document the user really wanted to find? Probably not. More likely they'd be interested in a document where "man" and "war" appear
close together.
Thus
XTF interprets
AND queries on the full text as
NEAR queries instead, with the slop factor set to the maximum for that index.
More formally, when used with terms, the
AND query will match any section of text that contains
all of the terms, in any order, as long as they are close together (that is, within the maximum proximity defined for the index, or 20 words in the default configuration.)
When used to group sub-queries together,
AND will match text where all of the sub-queries match, in any order, as long as the matches are close together (i.e. within the maximum proximity for the index.)
OR Query on Text
When used to group terms, an
OR query matches each occurrence of every term contained within it. If used to group sub-queries together, the
OR query matches each occurrence of every sub-query.
PHRASE Query on Text
A
PHRASE query generally contains two or more terms, and it matches any span of text where all the terms appear together, in order, with no other terms between them.
Less frequently, it can be used to group sub-queries. It matches a span of text where all of the sub-queries match, in order, without any intervening non-matching terms.
Note that a
PHRASE query is equivalent to a
NEAR query with a slop factor of zero.
NEAR Query on Text
Each
NEAR query requires a "slop" factor. In rough terms, this factor can be thought of as limiting the amount of sloppiness when matching. A slop of zero indicates very tight control; in fact, a
NEAR query with zero slop is equivalent to a
PHRASE query. A large slop value indicates that terms may appear far apart, or out of order, or both. Note however that the slop value is silently constrained to the maximum proximity defined by the chunk overlap of an index.
For more details on how slop is computed, see the section on
Proximity and Slop. For information on chunk overlap and how it relates to proximity searching, see the section on
Chunking.
The
NEAR query, when used with terms, matches a span of text containing all of the terms, where the match's slop is less than or equal to the slop factor specified for the query.
When used to group sub-queries, it matches a span of text where all of the sub-queries match, and the complete match's slop is less than or equal to the slop factor specified for the
NEAR query.
NOT Clause on Text
A
NOT clause may be specified as a sub-query of any boolean query (
OR,
AND,
PHRASE, or
NEAR). Any matches in the
NOT clause will suppress outer matches within the maximum proximity of the index. This can be thought of as a "kill zone": each match within the
NOT clause kills off nearby matches.
Meta-data Query Operations
The following query operations can be applied to any meta-data field. Queries are applied to meta-data and full text in like fashion, with a few exceptions:
AND queries are not proximity-based in meta-data fields,
NOT clauses on meta-data can eliminate whole documents, and a new operator,
RANGE, is available on meta-data fields.
TERM and Wildcard Queries on Meta-data
A
TERM query matches every occurrence of the specified term in the meta-data field. Upper-case vs. lower-case distinctions are ignored. Additionally, the term may contain special wildcard characters:
? |
The question-mark character matches terms with any character at that position. For example, lo?e would match love or lose, but not loe. |
* |
The asterisk character matches terms with any number (including zero) characters at that position. For example, dog* would match any of the following terms:dog, dogs, doggie, doggerel, etc. |
Depending on the particular wildcard, hundreds or even thousands of terms might match, so care should be taken when using these. To avoid allowing such queries to occupy the engine for long periods of time,
XTF allows queries to specify a limit on the maximum number of terms to match (controlled by the
workLimit attribute.) Queries that exceed the limit produce an error.
RANGE Queries on Meta-data
A
RANGE query is similar to a wildcard term query in that it matches a (possibly large) number of terms. Lower and upper bounding terms are specified, and
every term that appears in the index lexicographically between the two bounds is matched.
For example, if the lower bound were "1895" and the upper bound were "1900", a range query would match any of the terms 1895, 1896, 1897, 1898, 1899, and 1900. Optionally, the query can exclude the bounds, in which case it wouldn't match 1895 nor 1900.
As in the case of wildcard queries, care must be taken to avoid searching a huge number of terms. This can happen easily: in the case of the example above, if dates were encoded in the index in the form
YYYY-MM-DD, then all the days from 1895 to 1900 would match... potentially 2,190 of them. And of course a range query from A to Z would match practically every term in the index. Again, each query can specify a limit on the maximum number of terms to match, to avoid bogging down the engine.
However, when searching numeric data (for example, file time and date stamps) the above wildcard approach simply is not sufficient. To handle this, XTF provides a special
numeric range searching capability. This is specified as an attribute to the normal <range> query operator, but it tells XTF that the data is numeric, and in a rigid format (such as
YYYY-MM-DD:HH-MM-SS; any rigid format is acceptable). When the first such query is made, XTF loads a table of all the data values and converts them to 64-bit integers. This table is then cached in memory, and range queries on that field are processed very quickly, avoiding any wildcard-like expansion.
AND Query on Meta-data
Unlike in full-text queries, an
AND query on meta-data implies no proximity restrictions. When used with terms, it matches documents where
every term appears somewhere in the field, in any order.
When used to group sub-queries, it matches documents where
all of the sub-queries match (note that the sub-queries may be on several different fields.)
OR Query on Meta-data
When used to group terms, an
OR query matches a document where any of the terms occurs within the meta-data field.
If used to group sub-queries together, the
OR query matches documents that match by
any of the sub-queries (note that the sub-queries may involve several different fields.)
PHRASE Query on Meta-data
A
PHRASE query generally contains two or more terms, and it matches any document where the terms appear together in the field, in order, with no other terms between them.
Less frequently, it can be used to group sub-queries. It matches any document where all of the sub-queries match, in order, without any intervening non-matching terms.
Note that a
PHRASE query is equivalent to a
NEAR query with a slop factor of zero.
NEAR Query on Meta-data
Each
NEAR query requires a "slop" factor. In rough terms, this factor can be thought of as limiting the amount of sloppiness when matching. A slop of zero indicates very tight control; in fact, a
NEAR query with zero slop is equivalent to a
PHRASE query. A large slop value indicates that terms (or sub-queries) may appear far apart, or out of order, or both. There is no upper bound on the slop factor. For more details on how slop is computed, see the section on
Proximity and Slop. The
NEAR query, when used with terms, matches any document where all of the terms appear in the field and their group slop is less than or equal to the slop factor specified for the query.
When used to group sub-queries, it matches any document where all of the sub-queries match, and the complete match's slop is less than or equal to the slop factor specified for the
NEAR query.
NOT Clause on Meta-data
A
NOT clause may be specified as a sub-query of any boolean query (
OR,
AND,
PHRASE, or
NEAR). Any documents matching the
NEAR clause will be removed from the outer set of matches.
Stop Words in Queries
If a query contains one or more stop words, the query will be internally rewritten to work properly with the bi-gram system. Recall from the section on
Stop Words and Bi-grams that using bi-grams allows
XTF to support queries containing stop words while avoiding the usual severe impact on performance that they might have.
Here are some details on how stop words are handled in various query situations:
- In the absence of any grouping operator, querying for a single stop word in a TERM query will produce an empty result set.
- Wildcards queries will skip stop words; for example, searching for th? would not match "the" (which is a stop word), but would still match "thy".
- Stop words are stripped out of OR queries and NOT clauses.
- By contrast, PHRASE queries retain all stop words, because users would be dismayed if a query on "man of war" returned matches on "man in war".
- In AND and NEAR queries, stop words are joined with adjacent words to form bi-grams. The resulting query will effectively prefer matches containing the stop words in the correct places, but will allow matches where they don't appear.