org.cdlib.xtf.textIndexer
Class HTMLToString

Object
  extended by HTMLToString

public class HTMLToString
extends Object

This class provides a single static convert() method that converts an HTML file into an XML string that can be pre-filtered and added to a Lucene database by the XMLTextProcessor class.

Internally, the HTML to XML file conversion is performed by the jTidy library, which is a variant of the HTMLTidy converter.


Field Summary
private static HashMap htmlCodeMap
          Build a HashMap from the code table above
(package private) static String[] htmlCodes
          Table of conversions from HTML ampersand codes to UNICODE.
(package private) static Tidy tidy
          Create the HTMLTidy object that will do the work.
 
Constructor Summary
HTMLToString()
           
 
Method Summary
static String convert(InputStream htmlInputStream)
          Convert an HTML file into an HTMLTidy style XML string.
static String replaceHtmlCodes(String in)
          Convert any non-XML ampersand codes within a string to their unicode equivalents.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tidy

static Tidy tidy
Create the HTMLTidy object that will do the work.


htmlCodes

static final String[] htmlCodes
Table of conversions from HTML ampersand codes to UNICODE. We indicate the few codes that don't need conversion at the start of the table.


htmlCodeMap

private static HashMap htmlCodeMap
Build a HashMap from the code table above

Constructor Detail

HTMLToString

public HTMLToString()
Method Detail

convert

public static String convert(InputStream htmlInputStream)
Convert an HTML file into an HTMLTidy style XML string.

Parameters:
htmlInputStream - Stream of HTML text to convert to an XML string.
Returns:
If successful, a string containing the XML equivalent of the source HTML file. If an error occurred, this method returns null.

replaceHtmlCodes

public static String replaceHtmlCodes(String in)
Convert any non-XML ampersand codes within a string to their unicode equivalents.

Parameters:
in - The string within which to convert codes.