|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectit.unimi.dsi.mg4j.document.AbstractDocumentSequence
it.unimi.dsi.mg4j.document.AbstractDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
public class TRECDocumentCollection
A collection for the TREC data set.
The documents are stored as a set of descriptors, representing the (possibly
gzipped) file they are contained in and the start and stop position in that
file. To manage descriptors later we rely on SegmentedInputStream.
To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.
The collection provides both sequential access to all documents via the
iterator and random access to a given document. However, the two operations
are performed very differently as the sequential operation is much more
performant than calling document(int) repeatedly.
| Nested Class Summary | |
|---|---|
static class |
TRECDocumentCollection.Compression
|
static class |
TRECDocumentCollection.Match
Useful to match a series of bytes |
static class |
TRECDocumentCollection.Options
|
protected static class |
TRECDocumentCollection.TRECDocumentDescriptor
A compact description of the location and of the internal segmentation of a TREC document inside a file. |
| Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection |
|---|
it.unimi.dsi.mg4j.document.AbstractDocumentCollection.PropertyKeys |
| Field Summary | |
|---|---|
static int |
DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments. |
protected it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> |
descriptors
The list of document descriptors. |
protected it.unimi.dsi.mg4j.document.DocumentFactory |
factory
The document factory. |
| Fields inherited from interface it.unimi.dsi.mg4j.document.DocumentCollection |
|---|
DEFAULT_EXTENSION |
| Constructor Summary | |
|---|---|
|
TRECDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
int bufferSize,
TRECDocumentCollection.Compression compression)
Creates a new TREC collection by parsing the given files. |
protected |
TRECDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
int bufferSize,
TRECDocumentCollection.Compression compression)
Copy constructor (that is, the one used by copy(). |
| Method Summary | |
|---|---|
void |
close()
|
TRECDocumentCollection |
copy()
|
it.unimi.dsi.mg4j.document.Document |
document(int n)
Returns a document |
protected static boolean |
equals(byte[] a,
int len,
byte[] b)
|
it.unimi.dsi.mg4j.document.DocumentFactory |
factory()
Returns the underlying document factory |
it.unimi.dsi.mg4j.document.DocumentIterator |
iterator()
Returns the iterator over the documents. |
static void |
main(String[] arg)
|
void |
merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all. |
it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> |
metadata(int index)
|
protected void |
parseContent(int fileIndex,
InputStream is)
Process one of the file in order to find the blocks. |
static void |
run(TRECDocumentCollection.Options options,
String[] file)
Parses the document collection and finally stores the TRECDocumentCollection in a file |
int |
size()
|
InputStream |
stream(int n)
|
| Methods inherited from class uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection |
|---|
getReverseIDMap, getReverseIDMap |
| Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection |
|---|
ensureDocumentIndex, printAllDocuments, toString |
| Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentSequence |
|---|
filename, finalize, load |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentSequence |
|---|
filename |
| Field Detail |
|---|
public static final transient int DEFAULT_BUFFER_SIZE
protected it.unimi.dsi.mg4j.document.DocumentFactory factory
protected transient it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors
| Constructor Detail |
|---|
public TRECDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
int bufferSize,
TRECDocumentCollection.Compression compression)
throws IOException
file - an array of file names containing documents in TREC GOV2
format.factory - the document factory (usually, a composite one).bufferSize - the buffer size.compression - true if the files are gzipped.
IOException
protected TRECDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
int bufferSize,
TRECDocumentCollection.Compression compression)
copy(). Just
initializes final fields
| Method Detail |
|---|
protected static boolean equals(byte[] a,
int len,
byte[] b)
public static void main(String[] arg)
throws IOException,
InstantiationException,
IllegalAccessException,
InvocationTargetException,
NoSuchMethodException,
bpiwowar.argparser.InvalidHolderException,
org.apache.commons.configuration.ConfigurationException,
bpiwowar.argparser.ArgParserException
IOException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
bpiwowar.argparser.InvalidHolderException
org.apache.commons.configuration.ConfigurationException
bpiwowar.argparser.ArgParserException
public static void run(TRECDocumentCollection.Options options,
String[] file)
throws IOException,
org.apache.commons.configuration.ConfigurationException
options - the set of optionsfile - the list of document files to parse
IOException
org.apache.commons.configuration.ConfigurationException
public void close()
throws IOException
close in interface it.unimi.dsi.mg4j.document.DocumentSequenceclose in interface Closeableclose in class it.unimi.dsi.mg4j.document.AbstractDocumentSequenceIOExceptionpublic TRECDocumentCollection copy()
copy in interface it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentCollection>copy in interface it.unimi.dsi.mg4j.document.DocumentCollection
public it.unimi.dsi.mg4j.document.Document document(int n)
throws IOException
document in interface it.unimi.dsi.mg4j.document.DocumentCollectionn - number of the document to return
IOExceptionpublic it.unimi.dsi.mg4j.document.DocumentFactory factory()
factory in interface it.unimi.dsi.mg4j.document.DocumentSequence
public it.unimi.dsi.mg4j.document.DocumentIterator iterator()
throws IOException
iterator in interface it.unimi.dsi.mg4j.document.DocumentSequenceiterator in class it.unimi.dsi.mg4j.document.AbstractDocumentCollectionIOExceptionpublic void merge(TRECDocumentCollection other)
It is supposed that the passed object contains no duplicates for the local collection.
public it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata(int index)
metadata in interface it.unimi.dsi.mg4j.document.DocumentCollection
protected void parseContent(int fileIndex,
InputStream is)
throws IOException
TRECDocumentCollection.TRECDocumentDescriptor in this class).
fileIndex - The index in the file arrayis - The input stream for this file
IOExceptionpublic int size()
size in interface it.unimi.dsi.mg4j.document.DocumentCollection
public InputStream stream(int n)
throws IOException
stream in interface it.unimi.dsi.mg4j.document.DocumentCollectionIOException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||