org.apache.lucene.mark
Class ContextMarker

Object
  extended by ContextMarker

public class ContextMarker
extends Object

Workhorse class that handles marking hits, context surrounding hits, and search terms.

Created: Dec 26, 2004

Author:
Martin Haye

Field Summary
private  MarkCollector collector
          Client instance which receives the resulting marks
private  WordIter iter0
          Iterator used for locating the start of the hit/context
private  WordIter iter1
          Iterator used for locating the end of the hit/context
static int MARK_ALL_TERMS
          See MARK_NO_TERMS
static int MARK_CONTEXT_TERMS
          See MARK_NO_TERMS
static int MARK_NO_TERMS
          The following modes can be used for term marking: MARK_NO_TERMS: Terms are not marked MARK_SPAN_TERMS: Search terms are marked only within span hits.
static int MARK_SPAN_TERMS
          See MARK_NO_TERMS
private  int maxContext
          Target size (in chars) of the context surrounding each hit
private  int prevEndWord
          End of the previous context
private  Set stopSet
          Set of stop-words to avoid marking outside of hits
private  int termMode
          Whether to mark terms inside/outside hits, context, etc.
private  Set terms
          Set of search terms to mark
private  int termsMarkedPos
          Word position up to which we've marked all terms
private  MarkPos tmpPos
          Used to temporary position storage
 
Constructor Summary
ContextMarker(int maxContext, int termMode, Set terms, Set stopSet, WordIter wordIter, MarkCollector collector)
          Construct a new marker
 
Method Summary
(package private)  void emitMarks(Span posSpan, MarkPos contextStart, MarkPos contextEnd)
          Emit all the marks for the given hit.
(package private)  void findContext(Span posSpan, Span nextSpan, MarkPos contextStart, MarkPos contextEnd)
          Locate the start and end of context for the given hit.
 void mark(Span[] posOrderSpans, int maxContext)
          Mark a series of spans.
static void markField(FieldSpans fieldSpans, String field, WordIter iter, int maxContext, int termMode, Set stopSet, MarkCollector collector)
          Mark context, spans, and terms a field of data.
 void markField(String field, FieldSpans fieldSpans, MarkCollector collector)
          Mark context, spans, and terms within the given field of this document.
 void markField(String field, FieldSpans fieldSpans, WordIter iter, int maxContext, int termMode, Set stopSet, MarkCollector collector)
          Mark context, spans, and terms within the given field of this document.
private  void markTerms(WordIter iter, int fromPos, int toPos, boolean markStopWords)
          Mark terms up to (but not including) 'wordPos'
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MARK_NO_TERMS

public static final int MARK_NO_TERMS
The following modes can be used for term marking:

MARK_NO_TERMS: Terms are not marked

MARK_SPAN_TERMS: Search terms are marked only within span hits.

MARK_CONTEXT_TERMS: Search terms are marked within span hits and, if found, within the context surrounding those hits.

MARK_ALL_TERMS: Search terms are marked wherever they are found.

See Also:
Constant Field Values

MARK_SPAN_TERMS

public static final int MARK_SPAN_TERMS
See MARK_NO_TERMS

See Also:
Constant Field Values

MARK_CONTEXT_TERMS

public static final int MARK_CONTEXT_TERMS
See MARK_NO_TERMS

See Also:
Constant Field Values

MARK_ALL_TERMS

public static final int MARK_ALL_TERMS
See MARK_NO_TERMS

See Also:
Constant Field Values

maxContext

private int maxContext
Target size (in chars) of the context surrounding each hit


iter0

private WordIter iter0
Iterator used for locating the start of the hit/context


iter1

private WordIter iter1
Iterator used for locating the end of the hit/context


collector

private MarkCollector collector
Client instance which receives the resulting marks


terms

private Set terms
Set of search terms to mark


stopSet

private Set stopSet
Set of stop-words to avoid marking outside of hits


termMode

private int termMode
Whether to mark terms inside/outside hits, context, etc. See MARK_SPAN_TERMS, etc.


termsMarkedPos

private int termsMarkedPos
Word position up to which we've marked all terms


tmpPos

private MarkPos tmpPos
Used to temporary position storage


prevEndWord

private int prevEndWord
End of the previous context

Constructor Detail

ContextMarker

public ContextMarker(int maxContext,
                     int termMode,
                     Set terms,
                     Set stopSet,
                     WordIter wordIter,
                     MarkCollector collector)
Construct a new marker

Method Detail

markField

public void markField(String field,
                      FieldSpans fieldSpans,
                      MarkCollector collector)
Mark context, spans, and terms within the given field of this document. Context around each hit will be up to 80 characters (including the text of the hit itself). Search terms will only be marked within hits. If you would like to override these defaults, use one of the other variations of this method.

Parameters:
field - field name to mark
fieldSpans - spans to mark with
collector - collector to receive the marks

markField

public void markField(String field,
                      FieldSpans fieldSpans,
                      WordIter iter,
                      int maxContext,
                      int termMode,
                      Set stopSet,
                      MarkCollector collector)
Mark context, spans, and terms within the given field of this document.

Parameters:
field - field name to mark
iter - iterator over the words in the field
maxContext - target number of characters for context around each hit (including the text of the hit itself.) 80 is often a good choice. Specify zero to turn off context marking.
termMode - what areas to mark hits - see MARK_NO_TERMS.
stopSet - set of stop words to avoid marking outside hits
collector - collector to receive the marks

markField

public static void markField(FieldSpans fieldSpans,
                             String field,
                             WordIter iter,
                             int maxContext,
                             int termMode,
                             Set stopSet,
                             MarkCollector collector)
Mark context, spans, and terms a field of data.

Parameters:
field - field name to mark
iter - iterator over the words in the field
maxContext - target number of characters for context around each hit (including the text of the hit itself.) 80 is often a good choice. Specify zero to turn off context marking.
termMode - what areas to mark hits - see MARK_NO_TERMS.
stopSet - set of stop words to avoid marking outside hits
collector - collector to receive the marks

mark

public void mark(Span[] posOrderSpans,
                 int maxContext)
Mark a series of spans.

Parameters:
posOrderSpans - Spans to mark, in ascending position order.
maxContext - Target # of chars for context around hits (0 for none)

findContext

void findContext(Span posSpan,
                 Span nextSpan,
                 MarkPos contextStart,
                 MarkPos contextEnd)
Locate the start and end of context for the given hit.

Parameters:
posSpan - hit for which to find context
nextSpan - following hit (or null if none)
contextStart - OUT: start of context
contextEnd - OUT: end of context

emitMarks

void emitMarks(Span posSpan,
               MarkPos contextStart,
               MarkPos contextEnd)
Emit all the marks for the given hit.

Parameters:
posSpan - hit for which to emit marks
contextStart - start of context (or null if context disabled)
contextEnd - end of context (or null if context disabled)

markTerms

private void markTerms(WordIter iter,
                       int fromPos,
                       int toPos,
                       boolean markStopWords)
Mark terms up to (but not including) 'wordPos'