uk.ac.gla.dcs.renaissance.mg4j.trec
Class WARCDocumentCollection

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.dsi.mg4j.document.AbstractDocumentCollection
          extended by uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
              extended by uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
                  extended by uk.ac.gla.dcs.renaissance.mg4j.trec.WARCDocumentCollection
All Implemented Interfaces:
it.unimi.dsi.io.SafelyCloseable, it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentCollection>, it.unimi.dsi.mg4j.document.DocumentCollection, it.unimi.dsi.mg4j.document.DocumentSequence, Closeable, Serializable

public class WARCDocumentCollection
extends TRECDocumentCollection

Managing TREC collections provided in a WARC format, as used for instance by the TREC session track. A document collection basically consists of a set of descriptors pointing to important locations in the (possibly zipped) document archive. This is called a sequence.

Author:
Ingo Frommholz
See Also:
BuildWARCFileSequence, TRECDocumentCollection, DocumentFactory, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
TRECDocumentCollection.Compression, TRECDocumentCollection.Match, TRECDocumentCollection.Options, TRECDocumentCollection.TRECDocumentDescriptor
 
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
it.unimi.dsi.mg4j.document.AbstractDocumentCollection.PropertyKeys
 
Field Summary
 
Fields inherited from class uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
DEFAULT_BUFFER_SIZE, descriptors, factory
 
Fields inherited from interface it.unimi.dsi.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
WARCDocumentCollection(String[] file, it.unimi.dsi.mg4j.document.DocumentFactory factory, int bufferSize, TRECDocumentCollection.Compression compression)
          Creates a new TREC WARC collection by parsing the given files.
WARCDocumentCollection(String[] file, it.unimi.dsi.mg4j.document.DocumentFactory factory, it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, TRECDocumentCollection.Compression compression)
          Copy constructor (that is, the one used by TRECDocumentCollection.copy().
 
Method Summary
protected  void parseContent(int fileIndex, InputStream is)
          Process one of the file in order to find the blocks.
static void run(TRECDocumentCollection.Options options, String[] file)
          Parses the document collection and finally stores the created WARCDocumentCollection in a file
 
Methods inherited from class uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
close, copy, document, equals, factory, iterator, main, merge, metadata, size, stream
 
Methods inherited from class uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
getReverseIDMap, getReverseIDMap
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, printAllDocuments, toString
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentSequence
filename, finalize, load
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentSequence
filename
 

Constructor Detail

WARCDocumentCollection

public WARCDocumentCollection(String[] file,
                              it.unimi.dsi.mg4j.document.DocumentFactory factory,
                              int bufferSize,
                              TRECDocumentCollection.Compression compression)
                       throws IOException
Creates a new TREC WARC collection by parsing the given files.

Parameters:
file - an array of file names containing documents in TREC WARC format.
factory - the document factory (usually, a composite one).
bufferSize - the buffer size.
compression - true if the files are gzipped.
Throws:
IOException

WARCDocumentCollection

public WARCDocumentCollection(String[] file,
                              it.unimi.dsi.mg4j.document.DocumentFactory factory,
                              it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
                              int bufferSize,
                              TRECDocumentCollection.Compression compression)
Copy constructor (that is, the one used by TRECDocumentCollection.copy(). Just initializes final fields

Method Detail

parseContent

protected void parseContent(int fileIndex,
                            InputStream is)
                     throws IOException
Description copied from class: TRECDocumentCollection
Process one of the file in order to find the blocks. This identifies for each document its exact position and length in the set of files (see also TRECDocumentCollection.TRECDocumentDescriptor in this class).

Overrides:
parseContent in class TRECDocumentCollection
Parameters:
fileIndex - The index in the file array
is - The input stream for this file
Throws:
IOException

run

public static void run(TRECDocumentCollection.Options options,
                       String[] file)
                throws IOException,
                       org.apache.commons.configuration.ConfigurationException
Parses the document collection and finally stores the created WARCDocumentCollection in a file

Parameters:
options - the set of options
file - the list of document files to parse. If the list is empty, the files are read from STDIN
Throws:
IOException
org.apache.commons.configuration.ConfigurationException


Copyright © 2011. All Rights Reserved.