|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
ObjectDefaultHandler
XMLTextProcessor
public class XMLTextProcessor
This class performs the actual parsing of the XML source text files and
generates index information for it.
The XMLTextProcessor
class uses the configuration information
recorded in the IndexerConfig
instance passed to add one or more
source text documents to the associated Lucene index. The process of indexing
an XML source document consists of breaking the document up into small
overlapping "chunks" of text, and indexing the individual words encountered
in each chunk.
The reason source documents are split into chunks during indexing is to
allow the search engine to load only small pieces of a document when
displaying summary "blurbs" for matched text. This significantly lowers the
memory requirements to display search results for multiple documents. The
reason chunks are overlapped is to allow proximity matches to be found
that span adjacent chunks. At this time, the maximum distance that a
proximity can be found using this approach is equal to or less than the
chunk size used when the text document was indexed. This is because
proximity search checks are currently only performed on two adjacent
chunks.
Within a chunk, adjacent words are considered to be one word apart. In
Lucene parlance, the word bump for adjacent words is one.
Larger word bump values can be set for sub-sections of a document. Doing
so makes proximity matches within a sub-section more relevant than ones that
span sections.
Word bump adjustments are made through the use of attributes added to nodes
in the XML source text file. The available word bump attributes are:
In addition to the the word bump modifiers described above, there are two additional non-bump attributes that can be applied to nodes in a source text file:xtf:sentencebump="
xxx"
Set the additional word distance implied by sentence breaks in the associated node. If not set explicitly, the default sentence bump value is 5.
xtf:sectiontype="
xxx"
While this attribute's primary purpose is to assign names to a section of text, it also forces sections with a different names to start in new, non- overlapping chunks. The net result is equivalent to placing an "infinite word bump" between differently named sections, causing proximity searches to never find a match that spans multiple sections.
xtf:proximitybreak
Forces its associated node in the source text to start in a new, non-overlapping chunk. As with new sections described above, the net result is equivalent to placing an "infinite word bump" between adjacent sections, causing proximity searches to never find a match that spans the proximity break.
Normally, the above mentioned node attributes aren't actually in the source text nodes, but are embedded via the use of an XSL pre-filter before the node is indexed. The XSL pre-filter used is the one defined for the current index in the XML configuration file passed to thextf:boost="
xxx"
Boosts the ranking of words found in the associated node by multiplying their base relevance by the number xxx. Normally, a boost value greater than1.0
is used to emphasize the associated text, but values less than1.0
can be used as an "inverse" boost to de-emphasize the relevance of text. Also, since Lucene only applies boost values to entire chunks, changing the boost value for a node causes the text to start in a new, non-overlapping chunk.
xtf:noindex
This attribute when added to a source text node causes the contained text to not be indexed.
TextIndexer
.
uri
defined
by the xtfUri
member must be specified for
the XMLTextProcessor
to recognize and process them.
Nested Class Summary | |
---|---|
private class |
XMLTextProcessor.FileQueueEntry
|
private class |
XMLTextProcessor.MetaField
|
Field Summary | |
---|---|
private CharMap |
accentMap
The set of accented chars to remove diacritics from. |
private StringBuffer |
accumText
A buffer used to accumulate actual words from the source text, along with "virtual words" implied by any sectiontype and
proximitybreak attributes encountered, as well as various
special markers used to locate where in the XML source text the indexed
text is stored. |
private StringBuffer |
blurbedText
A buffer containing the "blurbified" text to be stored in the index. |
private static int |
bufStartSize
Initial size for various text accumulation buffers used by this class. |
private char[] |
charBuf
Character buffer for accumulating partial text blocks (possibly) passed in to the characters()
method from the SAX parser. |
private int |
charBufPos
Current end of the charBuf buffer. |
private int |
chunkCount
The number of chunks of XML source text that have been processed. |
private int |
chunkStartNode
The XML node in which the current chunk starts (may be different from the current node being processed, since chunks may span nodes.) |
private int |
chunkWordCount
Number of words accumulated so far for the current chunk. |
private int |
chunkWordOffset
The nodeWordCount
at which the current chunk begins. |
private int |
chunkWordOvlp
The number of words of overlap between two adjacent chunks. |
private int |
chunkWordOvlpStart
The start word offset at which the next overlapping chunk begins. |
private int |
chunkWordSize
The size in words of a chunk. |
private StringBuffer |
compactedAccumText
A version of the accumText member
where individual "virtual words" have been compacted down into special
offset markers. |
private IndexRecord |
curIdxRecord
The current record being indexed within curIdxSrc |
private IndexSource |
curIdxSrc
The location of the XML source text file currently being indexed. |
private int |
curNode
The current XML node we are currently reading source text from. |
private String |
curPrettyKey
Display name of the current file |
private int |
docWordCount
Number of words encountered so far for the current document. |
private LinkedList |
fileQueue
List of files to process. |
private boolean |
forcedChunk
Flag indicating that a new chunk needs to be created. |
private boolean |
ignoreFileTimes
Whether to ignore file modification times |
private IndexInfo |
indexInfo
A reference to the configuration information for the current index being updated. |
private String |
indexPath
The base directory for the current Lucene database. |
private IndexReader |
indexReader
An Lucene index reader object, used in conjunction with the indexSearcher to check if
the current document needs to be added to, updated in, or
removed from the index. |
private IndexSearcher |
indexSearcher
An Lucene index searcher object, used in conjunction with the indexReader to check if the
current document needs to be added to, updated in, or removed
from the index. |
private IndexWriter |
indexWriter
An Lucene index writer object, used to add or update documents to the index currently opened for writing. |
private int |
inMeta
Flag indicating how deeply nested in a meta-data section the current text/tag being processed is. |
private LazyTreeBuilder |
lazyBuilder
Object used to construct a "lazy tree" representation of the XML source file being processed. |
private ReceivingContentHandler |
lazyHandler
Wrapper for the lazyReceiver object
that translates SAX events to Saxon's internal Receiver API. |
private Receiver |
lazyReceiver
SAX Handler object for processing XML nodes into a "lazy tree" representation of the source docuement. |
private StructuredStore |
lazyStore
Storage for the "lazy tree" |
private static int |
MAX_DELETION_BATCH
Maximum number of document deletions to do in a single batch |
private StringBuffer |
metaBuf
A buffer for accumulating meta-text from the source XML file. |
private XMLTextProcessor.MetaField |
metaField
The current meta-field data being processed. |
private int |
nextChunkStartIdx
The character index in the chunk text accumulation buffer where the next overlapping chunk beings. |
private int |
nextChunkStartNode
Start node of next chunk. |
private int |
nextChunkWordCount
Number of words encountered so far in the next XML node. |
private int |
nextChunkWordOffset
The nodeWordCount
at which the next chunk begins. |
private int |
nodeWordCount
Number of words encountered so far for the current XML node. |
private WordMap |
pluralMap
The set of plural words to de-pluralize while indexing. |
private SectionInfoStack |
section
Stack containing the nesting level of the current text being processed. |
private SpellWriter |
spellWriter
Queues words for spelling dictionary creator |
private Set |
stopSet
The set of stop words to remove while indexing. |
private HashSet<String> |
subDocsWritten
Sub-documents that have been written out. |
private HashSet |
tokenizedFields
Keeps track of fields we already know are tokenized |
private String |
xtfHomePath
The base directory from which to resolve relative paths (if any) |
private static String |
xtfUri
The namespace string used to identify attributes that must be processed by the XMLTextProcessor class. |
Constructor Summary | |
---|---|
XMLTextProcessor()
|
Method Summary | |
---|---|
private void |
addToTokenizedFieldsFile(String field)
Adds a field to the on-disk list of tokenized fields for an index. |
void |
batchDelete()
If the first entry in the file queue requires deletion, we start up a batch delete up to MAX_DELETION_BATCH deletions. |
private void |
blurbify(StringBuffer text,
boolean trim)
Convert the given source text into a "blurb." |
void |
characters(char[] ch,
int start,
int length)
Accumulate chunks of text encountered between element/node/tags. |
void |
checkAndQueueText(IndexSource idxSrc)
Check and conditionally queue a source text file for (re)indexing. |
private int |
checkFile(IndexSource srcInfo)
Check to see if the current XML source text file exists in the Lucene database, and if so, whether or not it needs to be updated. |
void |
close()
Close the Lucene index. |
private void |
compactVirtualWords()
Compacts multiple adjacent virtual words into a special "virtual word count" token. |
private void |
copyDependentFile(String filePath,
String fieldName,
Document doc)
|
private void |
createIndex(IndexInfo indexInfo)
Utility function to create a new Lucene index database for reading or searching. |
boolean |
docExists(String key)
Checks if a given document exists in the index. |
void |
endDocument()
Perform any final document processing when the end of the XML source text has been reached. |
void |
endElement(String uri,
String localName,
String qName)
Process the end of a new XML source text element/node/tag. |
void |
endPrefixMapping(String prefix)
|
void |
flushCharacters()
Process any accumulated source text, writing indexing completed chunks to the Lucene database as necessary. |
private void |
forceNewChunk(SectionInfo secInfo)
Forces subsequent text to start at the beginning of a new chunk. |
private String |
getIndexPath()
Returns a normalized version of the base path of the Lucene database for an index. |
int |
getQueueSize()
Find out how many texts have been queued up using queueText(IndexSource, boolean) but not yet processed by
processQueuedTexts() . |
private void |
incrementNode()
Increment the node tracking information. |
private void |
indexText(SectionInfo secInfo)
Add the current accumulated chunk of text to the Lucene database for the active index. |
private void |
insertVirtualWords(StringBuffer text)
Inserts "virtual words" into the specified text as needed. |
private void |
insertVirtualWords(String vWord,
int count,
StringBuffer text,
int pos)
Utility function used by the main insertVirtualWords()
method to insert a specified number of virtual word symbols. |
private static boolean |
isAllWhitespace(String str,
int start,
int end)
Utility function to check if a string or a portion of a string is entirely whitespace. |
private boolean |
isEndOfSentence(int idx,
int len,
StringBuffer text)
Utility function to determine if the current character marks the end of a sentence. |
private boolean |
isSentencePunctuationChar(char theChar)
Utility function to detect sentence punctuation characters. |
void |
open(String homePath,
IndexInfo idxInfo,
boolean clean)
Version for source-level backward compatibility since this API is used sometimes externally. |
void |
open(String homePath,
IndexInfo idxInfo,
boolean clean,
boolean ignoreFileTimes)
Open a TextIndexer (Lucene) index for reading or writing. |
private void |
openIdxForReading()
Open the active Lucene index database for reading (and deleting, an oddity in Lucene). |
private void |
openIdxForWriting()
Open the active Lucene index database for writing. |
void |
optimizeIndex()
Runs an optimization pass (which can be quite time-consuming) on the currently open index. |
private int |
parseText()
Parse the XML source text file specified. |
private void |
precacheXSLKeys()
To speed accesses in dynaXML, the lazy tree is capable of storing pre-cached indexes to support each xsl:key declaration. |
void |
processingInstruction(String target,
String data)
|
private String |
processMetaAttribs(Attributes atts)
Build a string representing any non-XTF attributes in the given attribute list. |
private void |
processNodeAttributes(Attributes atts)
Process the attributes associated with an XML source text node. |
void |
processQueuedTexts()
Process the list of files queued for indexing or reindexing. |
private int |
processText(IndexSource file,
IndexRecord record,
int recordNum)
Add the specified XML source record to the active Lucene index. |
void |
queueText(IndexSource idxSrc)
Queue a source text file for indexing. |
void |
queueText(IndexSource srcInfo,
boolean deleteFirst)
Queue a source text file for (re)indexing. |
boolean |
removeSingleDoc(File srcFile,
String key)
Remove a single document from the index. |
private void |
saveDocInfo(SectionInfo secInfo)
Save document information associated with a collection of chunks. |
void |
startDocument()
Process the start of a new XML document. |
void |
startElement(String uri,
String localName,
String qName,
Attributes atts)
Process the start of a new XML source text element/node/tag. |
void |
startPrefixMapping(String prefix,
String uri)
|
private int |
trimAccumText(boolean oneEndSpace)
Utility method to trim trailing space characters from the end of the accumulated chunk text buffer. |
private boolean |
trueOrFalse(String value,
boolean defaultResult)
Utility function to check if a string contains the word true or false or the equivalent values yes or no. |
Methods inherited from class DefaultHandler |
---|
error, fatalError, ignorableWhitespace, notationDecl, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warning |
Methods inherited from class Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static final int bufStartSize
private int chunkCount
private int curNode
private int docWordCount
private HashSet<String> subDocsWritten
private int nodeWordCount
chunkWordOffset
whenever a
new text chunk is started.
private int chunkStartNode
private int chunkWordCount
private int chunkWordOffset
nodeWordCount
at which the current chunk begins. This value is stored with the chunk
in the index so that the search engine knows where a chunk appears in
the original source text.
private int nextChunkStartNode
chunkStartNode
when
processing for the current node is complete.
private int nextChunkWordCount
chunkWordCount
when processing for the current node is complete.
private int nextChunkWordOffset
nodeWordCount
at which the next chunk begins. Used to track the offset of the next
overlapping chunk while the current chunk is being processed. Copied
to chunkWordOffset
when processing for the current node is complete.
private int nextChunkStartIdx
private int chunkWordSize
private int chunkWordOvlp
private int chunkWordOvlpStart
chunkWordSize
and
chunkWordOvlp
.
private Set stopSet
IndexInfo.stopWords
for details.
private WordMap pluralMap
IndexInfo.pluralMapPath
for details.
private CharMap accentMap
IndexInfo.accentMapPath
for details.
private boolean forcedChunk
true
when a node's section name changes or a proximitybreak
attribute
is encountered.
private IndexInfo indexInfo
IndexInfo
class description for more details.
private boolean ignoreFileTimes
private String xtfHomePath
private LinkedList fileQueue
processQueuedTexts()
method.
private IndexSource curIdxSrc
IndexSource
class.
private IndexRecord curIdxRecord
curIdxSrc
private String curPrettyKey
private String indexPath
private LazyTreeBuilder lazyBuilder
LazyTreeBuilder
class.
private StructuredStore lazyStore
private Receiver lazyReceiver
lazyBuilder
member.
private ReceivingContentHandler lazyHandler
lazyReceiver
object
that translates SAX events to Saxon's internal Receiver API.
lazyReceiver
and
lazyBuilder
members for more details.
private char[] charBuf
characters()
method from the SAX parser.
private int charBufPos
charBuf
buffer.
private int inMeta
private XMLTextProcessor.MetaField metaField
private StringBuffer metaBuf
private IndexReader indexReader
indexSearcher
to check if
the current document needs to be added to, updated in, or
removed from the index.
private IndexSearcher indexSearcher
indexReader
to check if the
current document needs to be added to, updated in, or removed
from the index.
private IndexWriter indexWriter
private SpellWriter spellWriter
private HashSet tokenizedFields
private static final int MAX_DELETION_BATCH
private StringBuffer blurbedText
blurbify()
method.
private StringBuffer accumText
sectiontype
and
proximitybreak
attributes encountered, as well as various
special markers used to locate where in the XML source text the indexed
text is stored.
private StringBuffer compactedAccumText
accumText
member
where individual "virtual words" have been compacted down into special
offset markers. To learn more about "virtual words", see the
insertVirtualWords()
and
compactVirtualWords()
methods.
private SectionInfoStack section
SectionInfoStack
class for
more about section nesting.
private static final String xtfUri
XMLTextProcessor
class. XMLTextProcessor
, this string
("http://cdlib.org/xtf") must be set as the attributes uri
. To learn more
about pre-filter attributes, see the XMLTextProcessor
class
description.
Constructor Detail |
---|
public XMLTextProcessor()
Method Detail |
---|
public void open(String homePath, IndexInfo idxInfo, boolean clean) throws IOException
IOException
public void open(String homePath, IndexInfo idxInfo, boolean clean, boolean ignoreFileTimes) throws IOException
cfgInfo
for reading and searching. Index reading and searching
operations are used to clean, cull, or optimize an index. Opening an index
for writing is performed by the method
openIdxForWriting()
only when the index is being updated with new document information.
homePath
- Path from which to resolve relative path names.idxInfo
- A config structure containing information about the index
to open. clean
- true to truncate any existing index; false to add to it.
ignoreFileTimes
- true to ignore file time checks (only applies
during incremental indexing).
IOException
- Any I/O exceptions that occurred during the opening,
creation, or truncation of the Lucene index. clean
flag is set in the
cfgInfo
structure. indexInfo
to the passed configuration structure for use by other methods in this
class. public void close() throws IOException
IOException
- Any I/O exceptions that occurred during the closing,
of the Lucene index. indexReader
, indexWriter
or indexSearcher
objects open for the current Lucene index.
private void createIndex(IndexInfo indexInfo) throws IOException
createIndex()
method to create a new or clean index for reading and searching.
IOException
- Any I/O exceptions that occurred during the deletion of
a previous Lucene database or during the creation of the
new index currently specified by the internal
indexInfo
structure. private void copyDependentFile(String filePath, String fieldName, Document doc) throws IOException
IOException
public void checkAndQueueText(IndexSource idxSrc) throws ParserConfigurationException, SAXException, IOException
idxSrc
- The source to add to the queue of sources to be
indexed/reindexed. ParserConfigurationException
SAXException
IOException
processQueuedTexts()
method. public void queueText(IndexSource idxSrc)
idxSrc
- The data source to add to the queue of sources to be
indexed/reindexed. processQueuedTexts()
method. public void queueText(IndexSource srcInfo, boolean deleteFirst)
srcInfo
- The source XML text file to add to the queue of
files to be indexed/reindexed. processQueuedTexts()
method. public int getQueueSize()
queueText(IndexSource, boolean)
but not yet processed by
processQueuedTexts()
.
public boolean removeSingleDoc(File srcFile, String key) throws ParserConfigurationException, SAXException, IOException
srcFile
- The original XML source file, used to calculate the
location of the corresponding *.lazy file to delete.
If null, this step is skipped.key
- The key associated with the document in the index.
ParserConfigurationException
SAXException
IOException
public boolean docExists(String key) throws ParserConfigurationException, SAXException, IOException
key
- The key associated with the document in the index.
ParserConfigurationException
SAXException
IOException
public void batchDelete() throws IOException
MAX_DELETION_BATCH
deletions. We batch
these up because in Lucene, you can only delete with an IndexReader.
It costs time to close our IndexWriter, open an IndexReader for
the deletions, and then reopen the IndexWriter.
IOException
- Any I/O exceptions encountered when reading the source
text file or writing to the Lucene index. public void processQueuedTexts() throws IOException
IOException
- Any I/O exceptions encountered when reading the source
text file or writing to the Lucene index. XMLTextProcessor
opened the Lucene
database, (re)indexed the source file, and then closed the database
for each XML file encountered in the source tree. Unfortunately,
opening and closing the Lucene database is a fairly time consuming
operation, and doing so for each file made the time to index an entire
source tree much higher than it had to be. So to minimize the open/close
overhead, the XMLTextProcessor
was changed to traverse
the source tree first and collect all the XML filenames it found into a
processing queue. Once the files were queued, the Lucene database could
be opened, all the files in the queue could be (re)indexed, and the
database could be closed. Doing so significantly reduced the time to
index the entire source tree. private int processText(IndexSource file, IndexRecord record, int recordNum) throws IOException
indexPath
member.
file
- The XML source text file to process.record
- Record within the XML file to process.recordNum
- Zero-based index of this record in the XML file.
IOException
- Any I/O errors encountered opening or reading the XML source
text file or the Lucene database. XMLTextProcessor
class description.private int parseText()
startDocument()
,
startElement()
,
endElement()
,
endDocument()
, and
characters()
methods in this class to be called. These methods in turn process the
actual text in the XML source document, "blurbifying" the text, breaking
it up into overlapping chunks, and adding it to the Lucene index.
0
- XML source file successfully parsed and indexed.
-1
- One or more errors encountered processing
XML source file.blurbify()
method. sectiontype
and proximitybreak
are assumed to be prefixed by the namespace xtf
. indexInfo
member,
the XML file will be prefiltered with the specified XSL filter before
XML parsing begins. This allows node attributes to be inserted that
modify the proximity of various text sections as well as boost or
deemphasize the relevance sections of text. For a description of
attributes handled by this XML parser, see the XMLTextProcessor
class description. private void precacheXSLKeys() throws Exception
Exception
- If anything goes awry.public void startDocument() throws SAXException
startDocument
in interface ContentHandler
startDocument
in class DefaultHandler
SAXException
- Any exceptions encountered by the
lazyBuilder
during start of document processing. lazyBuilder
.public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException
startElement
in interface ContentHandler
startElement
in class DefaultHandler
uri
- Any namespace qualifier that applies to the current
XML tag.localName
- The non-qualified name of the current XML tag.qName
- The qualified name of the current XML tag.atts
- Any attributes for the current tag. Note that only
attributes that are in the namespace specified by the
xtfUri
member of this class are actually processed by this
method. SAXException
- Any exceptions generated by calls to "lazy tree" or
Lucene database access methods. flushCharacters()
method. It also calls the lazyHandler
object to write the accumulated text to the "lazy tree" representation
of the XML source file. Finally, it then resets the node tracking
information to match the new node, including any boost or bump
attributes set for the new node. private String processMetaAttribs(Attributes atts)
private void incrementNode()
public void endElement(String uri, String localName, String qName) throws SAXException
endElement
in interface ContentHandler
endElement
in class DefaultHandler
uri
- Any namespace qualifier that applies to the current
XML tag.localName
- The non-qualified name of the current XML tag.qName
- The qualified name of the current XML tag. SAXException
- Any exceptions generated by calls to "lazy tree" or
Lucene database access methods. flushCharacters()
method.
It also calls the lazyHandler
object to write the accumulated text to the "lazy tree" representation
of the XML source file. Finally, it returns the node tracking
information back to a state that match the parent node, including
any boost or bump attributes previously set for that node. public void processingInstruction(String target, String data) throws SAXException
processingInstruction
in interface ContentHandler
processingInstruction
in class DefaultHandler
SAXException
public void startPrefixMapping(String prefix, String uri) throws SAXException
startPrefixMapping
in interface ContentHandler
startPrefixMapping
in class DefaultHandler
SAXException
public void endPrefixMapping(String prefix) throws SAXException
endPrefixMapping
in interface ContentHandler
endPrefixMapping
in class DefaultHandler
SAXException
public void endDocument() throws SAXException
endDocument
in interface ContentHandler
endDocument
in class DefaultHandler
SAXException
- Any exceptions generated during the final writing
of the Lucene database or the "lazy tree"
representation of the XML file. public void characters(char[] ch, int start, int length) throws SAXException
characters
in interface ContentHandler
characters
in class DefaultHandler
ch
- A block of characters from which to accumulate text.
start
- The starting offset in ch
of the characters
to accumulate. length
- The number of characters to accumulate from ch
.
SAXException
XMLTextProcessor
to
correctly assemble overlapping chunks for the Lucene database, it needs
to have all the characters between two tags available as a single chunk
of text. Consequently this method simply accumulates text, and calls
calls from the XML parser to
startElement()
and
endElement()
trigger the actual processing of accumulated text.public void flushCharacters() throws SAXException
SAXException
- Any exceptions encountered during the processing of
the accumulated chunks, or writing them to the Lucene
index. 1. First the accumulated text is "blurbified." See theblurbify()
method for more information about what this entails.
2. Next, a chunk is assembled a word at a time from the accumulated text until the required chunk size (in words) is reached. The completed chunk is then added to the Lucene database.
3. Step two is repeated until no more complete chunks can be assembled from the accumulated text. (Any partial chunk text is saved until the next call to this method.)
private void forceNewChunk(SectionInfo secInfo)
private int trimAccumText(boolean oneEndSpace)
oneEndSpace
- Flag indicating whether the accumulated chunk text
buffer should be completely stripped of trailing
whitespace, or if one ending space should be left.
private void blurbify(StringBuffer text, boolean trim)
text
- Upon entry, the text to be converted into a "blurb."
Upon return, the resulting "blurbed" text.trim
- A flag indicating whether or not leading and trailing
whitespace should be trimmed from the resulting "blurb"
text. private void insertVirtualWords(StringBuffer text)
text
- The text into which virtual words should be inserted.
Luke Luck likes lakes.When virtual words are inserted into this text from Dr. Seuss' Fox in Sox, the resulting blurb text that is added to the index looks as follows:
Luke's duck likes lakes.
Luke Luck licks lakes.
Luke's duck licks lakes.
Duck takes licks in lakes Luck Luck likes.
And Luke Luck takes licks in lakes duck likes.
Luke Luck likes lakes. vw vw vw vw vwBecause of the virtual word insertion, the Luke at the beginning of the first sentence is considered to be two words away from the lakes at end, and the Luke at the beginning of the second sentence is considered to be five words away from the lakes at the end of the first sentence. The result is that in these sentences, the Lukes are considered closer to the lakes in their respective sentences than the ones in the adjacent sentences.
Luke's duck likes lakes. vw vw vw vw vw
Luke Luck licks lakes. vw vw vw vw vw
Luke's duck licks lakes. vw vw vw vw vw
Duck takes licks in lakes Luck Luck likes. vw vw vw vw vw
And Luke Luck takes licks in lakes duck likes.
VIRTUAL_WORD
member of the Constants
class, and has been chosen
to be unlikely to appear in any actual western text. compactVirtualWords()
.
private boolean isEndOfSentence(int idx, int len, StringBuffer text)
idx
- The character offset in the accumulated chunk text buffer
to check.len
- The total length of the text in the text buffer passed.text
- The text buffer to check.
true
- The current character marks the end of a sentence.false
- The current character does not mark
the end of a sentence. private boolean isSentencePunctuationChar(char theChar)
theChar
- The character to check.
true
- The specified character is a sentence punctuation
character. false
- The specified character is not a
sentence punctuation character. private void insertVirtualWords(String vWord, int count, StringBuffer text, int pos)
insertVirtualWords()
method to insert a specified number of virtual word symbols.
vWord
- The virtual word symbol to insert.count
- The number of virtual words to insert.text
- The text to insert the virtual words into.pos
- The character index in the text at which to insert the
virtual words. insertVirtualWords()
method.private void indexText(SectionInfo secInfo)
secInfo
- Info such as sectionType, wordBoost, etc.indexInfo
configuration member.
This includes compacting virtual words via the
compactVirtualWords()
method, and recording the unique document identifier (key) for the
chunk, the section type for the chunk, the word boost for the chunk,
the XML node in which the chunk begins, the word offset of the chunk,
and the "blurbified" text for the chunk. private static boolean isAllWhitespace(String str, int start, int end)
str
- String to check for all whitespace.start
- First character in string to check.end
- One index past the last character to check. true
- The specified range of the string is all whitespace.
false
- The specified range of the string is not all
whitespace. private void compactVirtualWords() throws IOException
IOException
- Any exceptions generated by low level string
operations. insertVirtualWords()
method. BUMP_MARKER
member of the
Constants
class. private boolean trueOrFalse(String value, boolean defaultResult)
value
- The string to check the value of.defaultResult
- The boolean value to default to if the string
doesn't contain true, false,
yes or no. private void processNodeAttributes(Attributes atts)
atts
- The attribute list to process. section
member for more
details. XMLTextProcessor
class description. private void saveDocInfo(SectionInfo secInfo)
IdxTreeCleaner
class to stript out any
partially indexed documents.private void addToTokenizedFieldsFile(String field)
private String getIndexPath() throws IOException
IOException
- Any exceptions generated retrieving the path for
a Lucene database. private int checkFile(IndexSource srcInfo) throws IOException
0
- Specified XML source document not found in the Lucene
database. 1
- Specified XML source document found in the index, and
the index is up-to-date. 2
- Specified XML source document is in the index, but
the source text has changed since it was last indexed.
IOException
curIdxSrc
member. public void optimizeIndex() throws IOException
IOException
private void openIdxForReading() throws IOException
IOException
- Any exceptions generated during the creation of the
Lucene database writer object.indexPath
member for reading
and/or deleting. It is strange that you delete things from a
Lucene index by using an IndexReader, but hey, whatever floats your
boat man.
private void openIdxForWriting() throws IOException
IOException
- Any exceptions generated during the creation of the
Lucene database writer object.indexPath
member for writing.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |