uk.ac.gla.dcs.renaissance.mg4j.trec
Class WARCDocumentCollection
java.lang.Object
it.unimi.dsi.mg4j.document.AbstractDocumentSequence
it.unimi.dsi.mg4j.document.AbstractDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.trec.WARCDocumentCollection
- All Implemented Interfaces:
- it.unimi.dsi.io.SafelyCloseable, it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentCollection>, it.unimi.dsi.mg4j.document.DocumentCollection, it.unimi.dsi.mg4j.document.DocumentSequence, Closeable, Serializable
public class WARCDocumentCollection
- extends TRECDocumentCollection
Managing TREC collections provided in a WARC format, as used for instance
by the TREC session track. A document collection basically consists of a set
of descriptors pointing to important locations in the (possibly zipped)
document archive. This is called a sequence.
- Author:
- Ingo Frommholz
- See Also:
BuildWARCFileSequence
,
TRECDocumentCollection
,
DocumentFactory
,
Serialized Form
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection |
it.unimi.dsi.mg4j.document.AbstractDocumentCollection.PropertyKeys |
Fields inherited from interface it.unimi.dsi.mg4j.document.DocumentCollection |
DEFAULT_EXTENSION |
Methods inherited from class uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection |
close, copy, document, equals, factory, iterator, main, merge, metadata, size, stream |
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection |
ensureDocumentIndex, printAllDocuments, toString |
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentSequence |
filename, finalize, load |
Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentSequence |
filename |
WARCDocumentCollection
public WARCDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
int bufferSize,
TRECDocumentCollection.Compression compression)
throws IOException
- Creates a new TREC WARC collection by parsing the given files.
- Parameters:
file
- an array of file names containing documents in TREC WARC
format.factory
- the document factory (usually, a composite one).bufferSize
- the buffer size.compression
- true if the files are gzipped.
- Throws:
IOException
WARCDocumentCollection
public WARCDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
int bufferSize,
TRECDocumentCollection.Compression compression)
- Copy constructor (that is, the one used by
TRECDocumentCollection.copy()
. Just
initializes final fields
parseContent
protected void parseContent(int fileIndex,
InputStream is)
throws IOException
- Description copied from class:
TRECDocumentCollection
- Process one of the file in order to find the blocks. This identifies
for each document its exact position and length in the set of files (see
also
TRECDocumentCollection.TRECDocumentDescriptor
in this class).
- Overrides:
parseContent
in class TRECDocumentCollection
- Parameters:
fileIndex
- The index in the file arrayis
- The input stream for this file
- Throws:
IOException
run
public static void run(TRECDocumentCollection.Options options,
String[] file)
throws IOException,
org.apache.commons.configuration.ConfigurationException
- Parses the document collection and finally stores the created
WARCDocumentCollection in a file
- Parameters:
options
- the set of optionsfile
- the list of document files to parse. If the list is
empty, the files are read from STDIN
- Throws:
IOException
org.apache.commons.configuration.ConfigurationException
Copyright © 2011. All Rights Reserved.