org.cdlib.xtf.textIndexer
Class XMLIndexSource

Object
  extended by IndexSource
      extended by XMLIndexSource
Direct Known Subclasses:
HTMLIndexSource, MSWordIndexSource, PDFIndexSource, TextIndexSource

public class XMLIndexSource
extends IndexSource

Supplies a single file containing a single record to the XMLTextProcessor.

Author:
Martin Haye

Field Summary
private  Templates displayStyle
          Stylesheet from which to gather XSLT key definitions to be computed and cached on disk.
private  InputSource inSrc
          Source of XML data
private  boolean isDone
          Keep track of whether we've processed this file yet
private  String key
          Key used to identify this file in the index
private  StructuredStore lazyStore
          Empty storage in which to build the persistent version of the document (aka the "lazy tree"), or null to avoid building it.
private  File path
          Path to the file, or null if it's not a local file.
private  Templates[] preFilters
          XSLT pre-filters used to massage the XML document (null for none)
private  boolean removeDoctypeDecl
          Whether to remove DOCTYPE decl (this is kind of a kludge)
private static SAXParser saxParser
          A parser we can use to tell whether we need to apply crimson workaround
 
Constructor Summary
XMLIndexSource(InputSource inSrc, File path, String key, Templates[] preFilters, Templates displayStyle, StructuredStore lazyStore)
          Constructor -- initializes all the fields
XMLIndexSource(InputSource inSrc, String key)
          Simple constructor
 
Method Summary
 Templates displayStyle()
          Stylesheet from which to gather XSLT key definitions to be computed and cached on disk.
protected  InputSource filterInput()
          Filter the input, if necessary, to remove DOCTYPE declarations, or work around a bug in the Crimson parser.
 String key()
          Obtain a unique key for this input file
 IndexRecord nextRecord()
          Obtain the next record from the file, or null if no more.
static String normalize(String s)
          Prepare a string for inclusion in an XML document.
 File path()
          Obtain the path to the file (or null if it's not a local file)
 Templates[] preFilters()
          Obtain set of prefilters to be run, serially in order, on each input record.
 void removeDoctypeDecl(boolean flag)
           
 long totalSize()
          Obtain the total size of the source file (used to calculate overall % done).
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

inSrc

private InputSource inSrc
Source of XML data


path

private File path
Path to the file, or null if it's not a local file.


key

private String key
Key used to identify this file in the index


preFilters

private Templates[] preFilters
XSLT pre-filters used to massage the XML document (null for none)


displayStyle

private Templates displayStyle
Stylesheet from which to gather XSLT key definitions to be computed and cached on disk. Typically, one would use the actual display stylesheet for this purpose, guaranteeing that all of its keys will be pre-cached.

Background: stylesheet processing can be optimized by using XSLT 'keys', which are declared with an <xsl:key> tag. The first time a key is used in a given source document, it must be calculated and its values stored on disk. The text indexer can optionally pre-compute the keys so they need not be calculated later during the display process.


lazyStore

private StructuredStore lazyStore
Empty storage in which to build the persistent version of the document (aka the "lazy tree"), or null to avoid building it.


removeDoctypeDecl

private boolean removeDoctypeDecl
Whether to remove DOCTYPE decl (this is kind of a kludge)


isDone

private boolean isDone
Keep track of whether we've processed this file yet


saxParser

private static SAXParser saxParser
A parser we can use to tell whether we need to apply crimson workaround

Constructor Detail

XMLIndexSource

public XMLIndexSource(InputSource inSrc,
                      String key)
Simple constructor


XMLIndexSource

public XMLIndexSource(InputSource inSrc,
                      File path,
                      String key,
                      Templates[] preFilters,
                      Templates displayStyle,
                      StructuredStore lazyStore)
Constructor -- initializes all the fields

Method Detail

removeDoctypeDecl

public void removeDoctypeDecl(boolean flag)

path

public File path()
Description copied from class: IndexSource
Obtain the path to the file (or null if it's not a local file)

Specified by:
path in class IndexSource

key

public String key()
Description copied from class: IndexSource
Obtain a unique key for this input file

Specified by:
key in class IndexSource

preFilters

public Templates[] preFilters()
Description copied from class: IndexSource
Obtain set of prefilters to be run, serially in order, on each input record.

Specified by:
preFilters in class IndexSource
Returns:
Prefilter stylesheet(s) to run, or null to for none.

displayStyle

public Templates displayStyle()
Description copied from class: IndexSource
Stylesheet from which to gather XSLT key definitions to be computed and cached on disk. Typically, one would use the actual display stylesheet for this purpose, guaranteeing that all of its keys will be pre-cached.

Background: stylesheet processing can be optimized by using XSLT 'keys', which are declared with an <xsl:key> tag. The first time a key is used in a given source document, it must be calculated and its values stored on disk. The text indexer can optionally pre-compute the keys so they need not be calculated later during the display process.

Specified by:
displayStyle in class IndexSource

totalSize

public long totalSize()
Description copied from class: IndexSource
Obtain the total size of the source file (used to calculate overall % done). If you don't know, return 1.

Specified by:
totalSize in class IndexSource

nextRecord

public IndexRecord nextRecord()
                       throws SAXException,
                              IOException
Description copied from class: IndexSource
Obtain the next record from the file, or null if no more.

Specified by:
nextRecord in class IndexSource
Throws:
SAXException
IOException

filterInput

protected InputSource filterInput()
                           throws IOException
Filter the input, if necessary, to remove DOCTYPE declarations, or work around a bug in the Crimson parser.

Throws:
IOException

normalize

public static String normalize(String s)
Prepare a string for inclusion in an XML document. Unicode strings are normalized to their canonical equivalents, a few characters are escaped as entities, and invalid characters are removed.

Parameters:
s - string to normalize
Returns:
possibly changed version of the string