public class BigramStopFilter
extends TokenFilter
Modifier and Type | Field and Description |
---|---|
private int |
accumIncrement
Accumulates position increment of removed tokens
|
private boolean |
firstTime
true before next() called for the first time
|
private int |
inputPos
Tracks the position of input tokens, for debugging
|
private int |
MAX_POSITION
Limit on position values, for the extremely rare case of fields with > 2000 entries
|
private Token |
nextToken
The next token to process
|
private int |
outputPos
Tracks the position of output tokens, for debugging
|
private Token |
outputQueue
Queue of output tokens, only required in some cases
|
private Set |
stopSet
Set of stop-words (e.g.
|
static Object |
tester
Basic regression test
|
Constructor and Description |
---|
BigramStopFilter(TokenStream input,
Set stopSet)
Construct a token stream to filter 'stopWords' out of 'input'.
|
Modifier and Type | Method and Description |
---|---|
private Token |
glomToken(Token token1,
Token token2,
int increment)
Constructs a new token, drawing the start position, position increment,
and end position from the specified tokens.
|
protected boolean |
isStopWord(String word)
Tells whether the token is a stop-word.
|
static Set |
makeStopSet(String stopWords)
Make a stop set given a space, comma, or semicolon delimited list of
stop words.
|
Token |
next()
Retrieve the next token in the stream.
|
private Token |
nextInput()
Retrieves the next token from the input stream, properly tracking the
input position.
|
Token |
nextInternal()
Retrieve the next token in the stream.
|
private Set stopSet
private boolean firstTime
private Token nextToken
private Token outputQueue
private int accumIncrement
private int outputPos
private int inputPos
private final int MAX_POSITION
public static final Object tester
public BigramStopFilter(TokenStream input, Set stopSet)
input
- Input stream of tokens to processstopSet
- Set of stop words to filter out. This can be most easily
made by calling makeStopSet()
.public static Set makeStopSet(String stopWords)
stopWords
- String of words to make into a setBigramStopFilter
.public Token next() throws IOException
next
in class TokenStream
IOException
public Token nextInternal() throws IOException
IOException
private Token nextInput() throws IOException
IOException
protected boolean isStopWord(String word)
private Token glomToken(Token token1, Token token2, int increment)