org.apache.lucene.util
Class FileSorter

Object
  extended by FileSorter

public class FileSorter
extends Object

Performs a disk-based sort of the lines of a text file, similar to the UNIX sort command. However, it is Unicode-aware.

Author:
Martin Haye

Nested Class Summary
private static class FileSorter.BlockReader
          Reads a block of compressed lines from the temporary disk file, and feeds them out one at a time.
static class FileSorter.FileOutput
          Advanced API class: write output to a file
static interface FileSorter.Output
          Advanced API interface for writing lines from the sorter
 
Field Summary
private  ArrayList blockOffsets
          Offsets of blocks already written to the temp file
private  ArrayList curBlockLines
          Buffer of lines in the current block
private  int curBlockMem
          Approximate amount of memory consumed by the current block of lines
static int DEFAULT_MEM_LIMIT
          Default memory limit if none specified
private  int memLimit
          Approximate limit on the amount of memory to consume during sort
private  int nLinesAdded
          Count of how many lines were read in
private static String SENTINEL
          Sentinel string used to mark end of blocks
private  File tmpFile
          File to use for temporary disk storage (automatically deleted)
 
Constructor Summary
protected FileSorter()
          Protected constructor -- do not construct directly; rather, use one of the simple, intermediate, or advanced API methods below.
 
Method Summary
 void addLine(String line)
          Add a line to be sorted.
private static void clearFile(File f)
          Delete, or at least truncate, the given file (if it exists)
 void finish(FileSorter.Output out)
          Perform the main work of sorting, sending the results to the specified output.
private  void flushBlock()
          Flush currently buffered lines to the temporary file.
static void main(String[] args)
          Simple command-line interface
private static int memSize(String s)
          Give a rough estimate of how much memory a given string takes
 int nLinesAdded()
          Find out how many lines were added
static void sort(File inFile, File outFile)
          Simple API: Sort from an input file to an output file
static void sort(File inFile, File outFile, File tmpDir, int memLimit)
          Intermediate API: sort from a file, to a file, using a specified temporary directory and memory limit.
static FileSorter start(File tmpDir, int memLimit)
          Advanced API, independent of input and output format.
 
Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_MEM_LIMIT

public static final int DEFAULT_MEM_LIMIT
Default memory limit if none specified

See Also:
Constant Field Values

tmpFile

private File tmpFile
File to use for temporary disk storage (automatically deleted)


memLimit

private int memLimit
Approximate limit on the amount of memory to consume during sort


nLinesAdded

private int nLinesAdded
Count of how many lines were read in


curBlockMem

private int curBlockMem
Approximate amount of memory consumed by the current block of lines


curBlockLines

private ArrayList curBlockLines
Buffer of lines in the current block


blockOffsets

private ArrayList blockOffsets
Offsets of blocks already written to the temp file


SENTINEL

private static String SENTINEL
Sentinel string used to mark end of blocks

Constructor Detail

FileSorter

protected FileSorter()
Protected constructor -- do not construct directly; rather, use one of the simple, intermediate, or advanced API methods below.

Method Detail

main

public static void main(String[] args)
Simple command-line interface


sort

public static void sort(File inFile,
                        File outFile)
                 throws IOException
Simple API: Sort from an input file to an output file

Throws:
IOException

sort

public static void sort(File inFile,
                        File outFile,
                        File tmpDir,
                        int memLimit)
                 throws IOException
Intermediate API: sort from a file, to a file, using a specified temporary directory and memory limit.

Parameters:
inFile - source of input lines, in UTF-8 encoding
outFile - destination of output lines
tmpDir - filesystem directory for temporary storage during sort. If null, then the system default temp directory will be used.
memLimit - approximate max amount of RAM to use during sort
Throws:
IOException

start

public static FileSorter start(File tmpDir,
                               int memLimit)
                        throws IOException
Advanced API, independent of input and output format. Uses "push" method, where first you call start() to obtain a FileSorter object. Then you repeatedly call putLine() to specify each line to be sorted. Then finally you call finish() to complete the sorting.

Parameters:
tmpDir - a filesystem directory to store temporary data during sort.
memLimit - approximate limit on the amount of RAM to use during sort.
Throws:
IOException

addLine

public void addLine(String line)
             throws IOException
Add a line to be sorted.

Parameters:
line - one line of data to be sorted
Throws:
IOException

nLinesAdded

public int nLinesAdded()
Find out how many lines were added


finish

public void finish(FileSorter.Output out)
            throws IOException
Perform the main work of sorting, sending the results to the specified output.

Throws:
IOException

flushBlock

private void flushBlock()
                 throws IOException
Flush currently buffered lines to the temporary file. This involves sorting them, and writing them out as a compressed block.

Throws:
IOException

clearFile

private static void clearFile(File f)
                       throws IOException
Delete, or at least truncate, the given file (if it exists)

Throws:
IOException

memSize

private static int memSize(String s)
Give a rough estimate of how much memory a given string takes