uk.ac.gla.dcs.renaissance.mg4j.trec
Class TRECDocumentCollection

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.dsi.mg4j.document.AbstractDocumentCollection
          extended by uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
              extended by uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
All Implemented Interfaces:
it.unimi.dsi.io.SafelyCloseable, it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentCollection>, it.unimi.dsi.mg4j.document.DocumentCollection, it.unimi.dsi.mg4j.document.DocumentSequence, Closeable, Serializable
Direct Known Subclasses:
WARCDocumentCollection

public class TRECDocumentCollection
extends AbstractDocumentCollection
implements Serializable

A collection for the TREC data set.

The documents are stored as a set of descriptors, representing the (possibly gzipped) file they are contained in and the start and stop position in that file. To manage descriptors later we rely on SegmentedInputStream.

To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.

The collection provides both sequential access to all documents via the iterator and random access to a given document. However, the two operations are performed very differently as the sequential operation is much more performant than calling document(int) repeatedly.

Author:
Alessio Orlandi, Luca Natali, Benjamin Piwowarski
See Also:
Serialized Form

Nested Class Summary
static class TRECDocumentCollection.Compression
           
static class TRECDocumentCollection.Match
          Useful to match a series of bytes
static class TRECDocumentCollection.Options
           
protected static class TRECDocumentCollection.TRECDocumentDescriptor
          A compact description of the location and of the internal segmentation of a TREC document inside a file.
 
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
it.unimi.dsi.mg4j.document.AbstractDocumentCollection.PropertyKeys
 
Field Summary
static int DEFAULT_BUFFER_SIZE
          Default buffer size, set up after some experiments.
protected  it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors
          The list of document descriptors.
protected  it.unimi.dsi.mg4j.document.DocumentFactory factory
          The document factory.
 
Fields inherited from interface it.unimi.dsi.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
  TRECDocumentCollection(String[] file, it.unimi.dsi.mg4j.document.DocumentFactory factory, int bufferSize, TRECDocumentCollection.Compression compression)
          Creates a new TREC collection by parsing the given files.
protected TRECDocumentCollection(String[] file, it.unimi.dsi.mg4j.document.DocumentFactory factory, it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, TRECDocumentCollection.Compression compression)
          Copy constructor (that is, the one used by copy().
 
Method Summary
 void close()
           
 TRECDocumentCollection copy()
           
 it.unimi.dsi.mg4j.document.Document document(int n)
          Returns a document
protected static boolean equals(byte[] a, int len, byte[] b)
           
 it.unimi.dsi.mg4j.document.DocumentFactory factory()
          Returns the underlying document factory
 it.unimi.dsi.mg4j.document.DocumentIterator iterator()
          Returns the iterator over the documents.
static void main(String[] arg)
           
 void merge(TRECDocumentCollection other)
          Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.
 it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata(int index)
           
protected  void parseContent(int fileIndex, InputStream is)
          Process one of the file in order to find the blocks.
static void run(TRECDocumentCollection.Options options, String[] file)
          Parses the document collection and finally stores the TRECDocumentCollection in a file
 int size()
           
 InputStream stream(int n)
           
 
Methods inherited from class uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
getReverseIDMap, getReverseIDMap
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentSequence
filename, finalize, load
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentSequence
filename
 

Field Detail

DEFAULT_BUFFER_SIZE

public static final transient int DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments.

See Also:
Constant Field Values

factory

protected it.unimi.dsi.mg4j.document.DocumentFactory factory
The document factory.


descriptors

protected transient it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors
The list of document descriptors. We assume that descriptors within the same file are contiguous - descriptors are saved separately, that's why they are transient

Constructor Detail

TRECDocumentCollection

public TRECDocumentCollection(String[] file,
                              it.unimi.dsi.mg4j.document.DocumentFactory factory,
                              int bufferSize,
                              TRECDocumentCollection.Compression compression)
                       throws IOException
Creates a new TREC collection by parsing the given files.

Parameters:
file - an array of file names containing documents in TREC GOV2 format.
factory - the document factory (usually, a composite one).
bufferSize - the buffer size.
compression - true if the files are gzipped.
Throws:
IOException

TRECDocumentCollection

protected TRECDocumentCollection(String[] file,
                                 it.unimi.dsi.mg4j.document.DocumentFactory factory,
                                 it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
                                 int bufferSize,
                                 TRECDocumentCollection.Compression compression)
Copy constructor (that is, the one used by copy(). Just initializes final fields

Method Detail

equals

protected static boolean equals(byte[] a,
                                int len,
                                byte[] b)

main

public static void main(String[] arg)
                 throws IOException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        bpiwowar.argparser.InvalidHolderException,
                        org.apache.commons.configuration.ConfigurationException,
                        bpiwowar.argparser.ArgParserException
Throws:
IOException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
bpiwowar.argparser.InvalidHolderException
org.apache.commons.configuration.ConfigurationException
bpiwowar.argparser.ArgParserException

run

public static void run(TRECDocumentCollection.Options options,
                       String[] file)
                throws IOException,
                       org.apache.commons.configuration.ConfigurationException
Parses the document collection and finally stores the TRECDocumentCollection in a file

Parameters:
options - the set of options
file - the list of document files to parse
Throws:
IOException
org.apache.commons.configuration.ConfigurationException

close

public void close()
           throws IOException
Specified by:
close in interface it.unimi.dsi.mg4j.document.DocumentSequence
Specified by:
close in interface Closeable
Overrides:
close in class it.unimi.dsi.mg4j.document.AbstractDocumentSequence
Throws:
IOException

copy

public TRECDocumentCollection copy()
Specified by:
copy in interface it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentCollection>
Specified by:
copy in interface it.unimi.dsi.mg4j.document.DocumentCollection

document

public it.unimi.dsi.mg4j.document.Document document(int n)
                                             throws IOException
Returns a document

Specified by:
document in interface it.unimi.dsi.mg4j.document.DocumentCollection
Parameters:
n - number of the document to return
Returns:
the document
Throws:
IOException

factory

public it.unimi.dsi.mg4j.document.DocumentFactory factory()
Returns the underlying document factory

Specified by:
factory in interface it.unimi.dsi.mg4j.document.DocumentSequence
Returns:
the document factory

iterator

public it.unimi.dsi.mg4j.document.DocumentIterator iterator()
                                                     throws IOException
Returns the iterator over the documents. Use this method if you want sequential access to the documents.

Specified by:
iterator in interface it.unimi.dsi.mg4j.document.DocumentSequence
Overrides:
iterator in class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
Returns:
the document iterator
Throws:
IOException

merge

public void merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all.

It is supposed that the passed object contains no duplicates for the local collection.


metadata

public it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata(int index)
Specified by:
metadata in interface it.unimi.dsi.mg4j.document.DocumentCollection

parseContent

protected void parseContent(int fileIndex,
                            InputStream is)
                     throws IOException
Process one of the file in order to find the blocks. This identifies for each document its exact position and length in the set of files (see also TRECDocumentCollection.TRECDocumentDescriptor in this class).

Parameters:
fileIndex - The index in the file array
is - The input stream for this file
Throws:
IOException

size

public int size()
Specified by:
size in interface it.unimi.dsi.mg4j.document.DocumentCollection

stream

public InputStream stream(int n)
                   throws IOException
Specified by:
stream in interface it.unimi.dsi.mg4j.document.DocumentCollection
Throws:
IOException


Copyright © 2011. All Rights Reserved.