org.cdlib.xtf.util
Class FastTokenizer

Object
  extended by TokenStream
      extended by Tokenizer
          extended by FastTokenizer

public class FastTokenizer
extends Tokenizer

Like Lucene's StandardTokenizer, but handles the easy cases very quickly. Punts the hard cases to a real StandardTokenizer, but this is rare enough that the speed increase is very substantial. Does not currently support Chinese/Japanese/Korean, but adding this support would be pretty easy.

Author:
Martin Haye

Nested Class Summary
private  class FastTokenizer.DribbleReader
          This class is used, when the fast tokenizer encounters a questionable situation, to dribble out characters to a standard tokenizer that can do a more complete job.
 
Field Summary
private static char[] charType
           
private  FastTokenizer.DribbleReader dribbleReader
          Used to dribble out tokens to a standard tokenizer; used when we encounter a case that's hard to figure out.
(package private) static char fakeChar
          We use a special character to mark the end of a FastTokenizer.DribbleReader.
(package private) static String fakeWord
          This is the special word used by DribbleReader
private  int pos
          Position within the source array
private  char[] source
          Array of characters to read from
private  Tokenizer stdTokenizer
          Standard tokenizer, used for hard cases only
 
Fields inherited from class Tokenizer
input
 
Constructor Summary
FastTokenizer(FastStringReader reader)
          Create a tokenizer that will tokenize the stream of characters from the given reader.
 
Method Summary
 Token next()
          Retrieve the next token in the stream, or null if there are no more.
private static void setCharType(char type, char from, char to)
          Utility method used when setting up the character type table
 
Methods inherited from class Tokenizer
close
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

source

private char[] source
Array of characters to read from


pos

private int pos
Position within the source array


fakeChar

static final char fakeChar
We use a special character to mark the end of a FastTokenizer.DribbleReader.

See Also:
Constant Field Values

fakeWord

static final String fakeWord
This is the special word used by DribbleReader

See Also:
Constant Field Values

dribbleReader

private FastTokenizer.DribbleReader dribbleReader
Used to dribble out tokens to a standard tokenizer; used when we encounter a case that's hard to figure out.


stdTokenizer

private Tokenizer stdTokenizer
Standard tokenizer, used for hard cases only


charType

private static final char[] charType
Constructor Detail

FastTokenizer

public FastTokenizer(FastStringReader reader)
Create a tokenizer that will tokenize the stream of characters from the given reader. Note that the reader must be an instance of FastStringReader, or else fast tokenization isn't possible.

Parameters:
reader - Reader to get data from.
Method Detail

setCharType

private static void setCharType(char type,
                                char from,
                                char to)
Utility method used when setting up the character type table


next

public Token next()
           throws IOException
Retrieve the next token in the stream, or null if there are no more.

Specified by:
next in class TokenStream
Throws:
IOException