|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
DocNumMap | Provides information on the chunk size and chunk overlap for a given index, and provides a mapping from main documents to the chunks they are made of. |
Class Summary | |
---|---|
Chunk | Keeps track of all the tokens in a given chunk of text, also maintaining a reference back to the source of the chunk. |
ChunkedWordIter | Iterates over words in a large document that has been broken up into
many overlapping Chunk s. |
ChunkMarkPos | Tracks the position of a ChunkedWordIter as it progresses through
a document which has been broken into chunks. |
ChunkSource | Reads and caches chunks from an index. |
SpanChunkedNotQuery | Removes matches which overlap with another SpanQuery, taking into account overlap between adjacent chunks in a chunked index. |
SpanDechunkingQuery | Wraps a SpanQuery, converting chunk spans to look like they're all part of the main document. |
SparseStringComparator | |
SparseStringComparator.EntryComparator | Compare two entries for sorting purposes |
This package is for handling for very large documents that have been indexed in overlapping chunks.
Lucene deals readily with documents in the range of a few bytes to maybe a hundred kbytes. But throw a 10-megabyte document at it, and things start to break down. For one thing, generating snippets on a number of these documents becomes very slow.
One technique for dealing with these large documents is to index them in small "chunks", for instance, breaking a document into 200-word chunks. In order for proximity queries to still be effective, these chunks need to overlap. For instance, if one queried for "bat man", one would expect to get a hit even if "bat" appears at the end of one chunk and "man" appears at the start of the next.
Breaking documents into chunks isn't addressed by this package, but once you've indexed it that way, the classes in this package will help to query the chunked index. Follow these steps:
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |