org.cdlib.xtf.textIndexer
Class IndexInfo
Object
IndexInfo
public class IndexInfo
- extends Object
This class maintains configuration information about the current index that
the TextIndexer program is processing.
Information stored by this class includes:
- The name of the current index being processed.
- The path where the Lucene index database is (to be) stored.
- The path where the source text for this index can be found.
- The path where any XSLT input filters for this index can be found.
- A specification for source text files to ignore.
- The text chunk size and overlap attributes for the current index.
- Specifications for stop word removal.
Field Summary |
String |
accentMapPath
Path to a mapping from accented characters to their corresponding
chars with teh diacritics removed. |
int[] |
chunkAtt
Text chunk attribute array. |
static int |
chunkOvlp
Index into Chunk Attribute Array for the chunk size attribute. |
static int |
chunkSize
Index into Chunk Attribute Array for the chunk size attribute. |
boolean |
createSpellcheckDict
Whether to create a spellcheck dictionary for this index |
static int |
defaultChunkOvlp
Constant defining the default overlap (in words) of two adjacent text
chunks. |
static int |
defaultChunkSize
Constant defining the default size (in words) of a text chunk. |
static String |
defaultStopWords
Constant defining the default list of stop words. |
String |
docSelectorPath
Path to stylesheet used to determine which documents to index |
String |
indexName
Name of the current index being processed (as specified in the index
configuration file.) |
String |
indexPath
Name of the path to the current index's Lucene database. |
static int |
minChunkSize
Constant defining the minimum size (in words) of a text chunk. |
String |
pluralMapPath
Path to a mapping from plural words to their corresponding singular
forms that the textIndexer should fold together. |
String |
sourcePath
Path to the source text for the current index. |
String |
stopWords
Set of stop words to remove. |
boolean |
stripWhitespace
Whether to strip whitespace between elements in lazy tree files. |
String |
subDir
Name of a sub-directory to index, or null to index everything |
Constructor Summary |
IndexInfo()
Default constructor. |
IndexInfo(String indexName,
String indexPath)
Alternate constructor. |
Method Summary |
int |
getChunkOvlp()
Return the overlap of two adjacent text chunks for the current index. |
String |
getChunkOvlpStr()
Return the overlap (in words) for two adjacent text text chunks in the
current index as a string. |
int |
getChunkSize()
Return the size of a text chunk for the current index. |
String |
getChunkSizeStr()
Return the size of a text chunk (in words) for the current index
as a string. |
int |
setChunkOvlp(int newChunkOverlap)
Sets the adjacent chunk overlap attribute for the current index. |
int |
setChunkSize(int newChunkSize)
Sets the text chunk size attribute for the current index. |
Methods inherited from class Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
indexName
public String indexName
- Name of the current index being processed (as specified in the index
configuration file.)
subDir
public String subDir
- Name of a sub-directory to index, or null to index everything
indexPath
public String indexPath
- Name of the path to the current index's Lucene database.
sourcePath
public String sourcePath
- Path to the source text for the current index.
docSelectorPath
public String docSelectorPath
- Path to stylesheet used to determine which documents to index
stopWords
public String stopWords
- Set of stop words to remove. Stop words are common words such as "the",
"and", etc. which are so ubiquitous as to add little value to queries.
Rather than remove them entirely however, we take an approach suggested
by Doug Cutting (inventor of Lucene).
Basically, stop words are joined to surrounding normal words. This speeds
queries while still producing good results for requests that contain
a mixture of stop words and normal words (which is by far the most common
case for queries.)
For example, the string "man of war" would be indexed like this:
"man man-of of-war war". This way, searching for "man war" will pull up a
hit, but a search for "man of war" will score higher, as long as the same
stop-word approach is applied to the query.
You might ask what happens in this case: "joke of the year" (two stop
words in a row.) We could index it as "joke joke-of of-the the-year", or
as the longer but more complete "joke joke-of joke-of-the of-the
of-the-year the-year". The second form doesn't offer much improvement
in searching and would make the index bigger and logic more complex.
So we always combine a stop word with at most one neighboring word.
The words in this list may be separated by spaces, commas, and/or
semicolons.
pluralMapPath
public String pluralMapPath
- Path to a mapping from plural words to their corresponding singular
forms that the textIndexer should fold together. This can yield better
search results. For instance, if a user searches for "cat" they probably
also would like results for "cats."
The file should be a plain text file, with one word pair per line.
First is the plural form of a word, followed by a "|" character,
followed by the singular form. All should be lowercase, even in the
case of acronyms.
Optionally, the file may be compressed in GZIP format, in which case
it must end in the extension ".gz".
Non-ASCII characters should be encoded in UTF-8 format.
accentMapPath
public String accentMapPath
- Path to a mapping from accented characters to their corresponding
chars with teh diacritics removed. These chars will be folded together
which can yield better search results. For instance, a German user
on an American keyboard might want to find "Hut" with an umlaut over the
"u", but can't type the umlaut. This way, if they type "hat" they'll still
get a match.
The file should be a plain text file, with one code pair per line.
First is the 4-digit hex Unicode point for the accented character,
followed by "|", then the 4-digit hex code for the unaccented form.
createSpellcheckDict
public boolean createSpellcheckDict
- Whether to create a spellcheck dictionary for this index
stripWhitespace
public boolean stripWhitespace
- Whether to strip whitespace between elements in lazy tree files. Not
strictly safe for all XML documents, but it can make lazy trees
somewhat smaller and faster.
chunkAtt
public int[] chunkAtt
- Text chunk attribute array. Currently this array consists of two entries:
- The size of the text chunk in words.
- The overlap in words of adjacent text chunks.
These array members should be addressed using chunkSize
}
and chunkOvlp
constants defined by this class.
- Notes:
- For an explanation of the text chunk size and overlap attributes,
see
chunkSize
and chunkOvlp
.
chunkSize
public static final int chunkSize
- Index into Chunk Attribute Array for the chunk size attribute.
Indexed text stored in the a Lucine index is broken up in to small chunks
so that search result "summary blurbs" can be easily generated without
having to load the entire source text. The chunk size attribute reflects
the chunk size (in words) used by the current index.
- See Also:
- Constant Field Values
chunkOvlp
public static final int chunkOvlp
- Index into Chunk Attribute Array for the chunk size attribute.
Indexed text stored in the a Lucine index is broken up in to small chunks
that overlap with adjacent chunks so that "summary blurbs" for proximity
searches can be easily generated without having to load the entire source
text. The chunk overlap attribute reflects the overlap (in words) used by
the current index.
- See Also:
- Constant Field Values
minChunkSize
public static final int minChunkSize
- Constant defining the minimum size (in words) of a text chunk.
Value = 2.
- See Also:
- Constant Field Values
- Notes:
- For an explanation of the text chunk size and overlap attributes,
see
chunkSize
and chunkOvlp
.
defaultChunkSize
public static final int defaultChunkSize
- Constant defining the default size (in words) of a text chunk.
Value = 100.
- See Also:
- Constant Field Values
- Notes:
- For an explanation of the text chunk size and overlap attributes,
see
chunkSize
and chunkOvlp
.
defaultChunkOvlp
public static final int defaultChunkOvlp
- Constant defining the default overlap (in words) of two adjacent text
chunks. Value = 50.
- See Also:
- Constant Field Values
- Notes:
- For an explanation of the text chunk size and overlap attributes,
see
chunkSize
and chunkOvlp
.
defaultStopWords
public static final String defaultStopWords
- Constant defining the default list of stop words. These are common words
that are so ubiquitous as to be of little use in queries. Value = "a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with".
- See Also:
- Constant Field Values
- Notes:
- For an explanation of stop word handling,
see
stopWords
IndexInfo
public IndexInfo()
- Default constructor.
Creates the chunk attribute array, and initializes the
chunkSize
entry to
defaultChunkSize
,
and the chunkOvlp
entry to
defaultChunkOvlp
.
IndexInfo
public IndexInfo(String indexName,
String indexPath)
- Alternate constructor.
Initializes the fields needed to use InputStream-based indexing (that is,
all fields except subDir, sourcePath, and docSelectorPath.)
Uses default values for chunk size/overlap, and for the stop word list.
After construction, these may of course be altered if desired.
getChunkSize
public int getChunkSize()
- Return the size of a text chunk for the current index.
- Returns:
- The value of the
chunkSize
attribute.
- Notes:
- For an explanation of the text chunk size and overlap attributes,
see
chunkSize
and chunkOvlp
.
getChunkSizeStr
public String getChunkSizeStr()
- Return the size of a text chunk (in words) for the current index
as a string.
- Returns:
- The value of the
chunkSize
attribute converted
to a String.
- Notes:
- This method is intended as a convenience call for code that
creats Lucene fields, which are all stored as strings.
For an explanation of the text chunk size and overlap attributes,
see chunkSize
and chunkOvlp
.
getChunkOvlp
public int getChunkOvlp()
- Return the overlap of two adjacent text chunks for the current index.
- Returns:
- The value of the
chunkOvlp
attribute.
- Notes:
- For an explanation of the text chunk size and overlap attributes,
see
chunkSize
and chunkOvlp
.
getChunkOvlpStr
public String getChunkOvlpStr()
- Return the overlap (in words) for two adjacent text text chunks in the
current index as a string.
- Returns:
- The value of the
chunkOvlp
attribute
converted to a String.
- Notes:
- This method is intended as a convenience call for code that
creats Lucene fields, which are all stored as strings.
For an explanation of the text chunk size and overlap attributes,
see chunkSize
and chunkOvlp
.
setChunkSize
public int setChunkSize(int newChunkSize)
- Sets the text chunk size attribute for the current index.
This method sets the value for the chunkSize
attribute, coercing its value to be greater than or equal to the
minChunkSize
value.
- Returns:
- The resulting coerced chunkSize value.
- Notes:
- This function also calls the
setChunkOvlp()
method to ensure that the overlap value is valid for the
chunk size set by this call.
For an explanation of the text chunk size and overlap attributes,
see chunkSize
and chunkOvlp
.
setChunkOvlp
public int setChunkOvlp(int newChunkOverlap)
- Sets the adjacent chunk overlap attribute for the current index.
This method sets the value for the
chunkOvlp
attribute,
coercing its value to be less than or equal to the half the current chunk
size for the current index.
- Returns:
- The resulting coerced chunkOvlp value.
For an explanation of the text chunk size and overlap attributes,
see chunkSize
and chunkOvlp
.