[ You are here:
XTF ->
Programming ->
textIndexer -> Document Selector ]
Document Selector Programming
This section describes how to program an XTF
Document Selector stylesheet. If you want to skip the tutorial, you can check out the
Reference section or the
default docSelector.xsl code.
The primary purpose of the
textIndexer Document Selector is to select which files in the document library are to be indexed. Since the
Document Selector is an XSLT stylesheet, its input is in fact an XML fragment that identifies a single directory in the document library and the files that it contains. The
Document Selector stylesheet is invoked one time for each subdirectory encountered in the document library, and the input it receives looks as follows:
<directory dirPath="DirectoryPath">
<file fileName="FileName1"/>
<file fileName="FileName2"/>
…
<file fileName="FileNameN"/>
</directory>
The
<directory> tag identifies a single directory in the document library, and the
DirectoryPath attribute specifies its absolute file system path. Within the
<directory> tag, each of the
<file/> entries identifies one of files found in the directory. Note that
FileName1 through
FileNameN do not contain any path information, since the absolute path that applies to all the file tags is already identified by
DirectoryPath.
It is the responsibility of the
Document Selector XSLT code to output an XML fragment that identifies which of the files in the directory should be indexed. This output XML fragment should take the following form:
<indexFiles>
<indexFile fileName = "FileName"
{format = "FileFormatID"}
{preFilter = "PreFilterPath"}
{displayStyle = "DocumentFormatterPath"}/>
…
</indexFiles>
Note that the output XML consists of a single
<indexFiles> container tag and one
<indexFile/> tag for each document file that needs to be indexed. Within each of the
<indexFile/> tags, the following attributes are defined:
fileName |
This attribute identifies the name of a file to be indexed, and should be one of the file names received in the input XML fragment. |
format |
This is an optional attribute that defines the format of the file to be indexed. At this time, XML, PDF, HTML, Word, and Plain Text are supported by the textIndexer tool, and this attribute should be set to the strings XML, PDF, HTML, MSWord, or Text respectively, depending on the native format of the file. If this attribute is not specified, the textIndexer will try to infer the file type based on the extension for the file. |
preFilter |
This is an optional attribute that defines the Pre-Filter stylesheet that the textIndexer should use on this document file. If not specified, the text for this file will not be filtered before indexing. See the textIndexer Pre-Filter Programming section for more details about document pre-filtering. |
displayStyle |
This is an optional attribute that defines the Document Formatter stylesheet associated with the given file. If specified, the textIndexer will create a special cache that is used by the dynaXML servlet to display selected documents more quickly. If not specified, the cache for the current file is not created. For more details, see the discussion of Lazy Document Handling in the XTF Under the Hood guide. |
Using these XML input and output specifications shown, we can build up a document selector that handles all the types of files to index. We're going to start simple and work our way up. A very simple document selector might look something like this:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="directory">
<indexFiles>
<xsl:apply-templates/>
</indexFiles>
</xsl:template>
<xsl:template match="file">
<xsl:choose>
<xsl:when test="ends-with(@fileName, '.pdf')">
<indexFile fileName="{@fileName}" type="PDF"/>
</xsl:when>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
In this simple
Document Selector example, the first line establishes the
xsl namespace used in the rest of the stylesheet. Next, the
<xsl:template match="directory"> tag looks for the
<directory> block in the input XML, and writes out a corresponding
<indexFiles> block to the output XML. Also the
<xsl:template match="file"> template is applied to any tags found within the
<directory> block.
The
<xsl:template match="file"> block is the code that is actually responsible for selecting the files to be indexed. In this example, only files that end in
.pdf are passed on for indexing, and are assigned the format PDF. No
Pre-Filter or
Document Formatter stylesheets are defined, and so the
textIndexer will not pre-filter or pre-cache display information for PDF files.
Selecting other file types for indexing is as simple as adding more
<xsl:when> clauses to the
<xsl:choose> block, like this:
…
<xsl:template match="file">
<xsl:choose>
<!-- XML files -->
<xsl:when test="ends-with(@fileName, '.xml')">
<!-- More detailed work here, to determine if it's TEI, EAD, NLM, etc. -->
</xsl:when>
<!-- PDF files -->
<xsl:when test="ends-with(@fileName, '.pdf')">
<indexFile fileName="{@fileName}"
type="PDF"
preFilter="style/textIndexer/default/defaultPreFilter.xsl"/>
</xsl:when>
<!-- Plain text files -->
<xsl:when test="ends-with(@fileName, '.txt')">
<indexFile fileName="{@fileName}"
type="text"
preFilter="style/textIndexer/default/defaultPreFilter.xsl"/>
</xsl:when>
</xsl:template>
</xsl:stylesheet>
This revies
<xsl:choose> block looks for XML, PDF, and Text files. Note that the
<indexFile> tags also define a
Pre-Filter stylesheet for each type.
While this simple
Document Selector example works, its file selection rules are limited only to checking for certain file extensions. Clearly, all the power of XSLT could be used to construct more complicated selection criteria for files, including ignoring various directories, pulling in meta-data from files or URLs, and so on.
Now you're equipped to understand the default
Document Selector provided by XTF. You can check out the
default docSelector.xsl at SourceForge, or you can edit it in your own directory:
style/textIndexer/docSelector.xsl.
Next, we'll learn how to
program the Pre-Filter stylesheet.