|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.document.AbstractDocumentSequence
it.unimi.dsi.mg4j.document.AbstractDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection
uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentCollection
public class TRECDocumentCollection
A collection for the TREC data set.
The documents are stored as a set of descriptors, representing the (possibly
gzipped) file they are contained in and the start and stop position in that
file. To manage descriptors later we rely on SegmentedInputStream
.
To interpret a file, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.
The collection provides both sequential access to all documents via the
iterator and random access to a given document. However, the two operations
are performed very differently as the sequential operation is much more
performant than calling document(int)
repeatedly.
Nested Class Summary | |
---|---|
static class |
TRECDocumentCollection.Compression
|
static class |
TRECDocumentCollection.Match
Useful to match a series of bytes |
static class |
TRECDocumentCollection.Options
|
protected static class |
TRECDocumentCollection.TRECDocumentDescriptor
A compact description of the location and of the internal segmentation of a TREC document inside a file. |
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection |
---|
it.unimi.dsi.mg4j.document.AbstractDocumentCollection.PropertyKeys |
Field Summary | |
---|---|
static int |
DEFAULT_BUFFER_SIZE
Default buffer size, set up after some experiments. |
protected it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> |
descriptors
The list of document descriptors. |
protected it.unimi.dsi.mg4j.document.DocumentFactory |
factory
The document factory. |
Fields inherited from interface it.unimi.dsi.mg4j.document.DocumentCollection |
---|
DEFAULT_EXTENSION |
Constructor Summary | |
---|---|
|
TRECDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
int bufferSize,
TRECDocumentCollection.Compression compression)
Creates a new TREC collection by parsing the given files. |
protected |
TRECDocumentCollection(String[] file,
it.unimi.dsi.mg4j.document.DocumentFactory factory,
it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors,
int bufferSize,
TRECDocumentCollection.Compression compression)
Copy constructor (that is, the one used by copy() . |
Method Summary | |
---|---|
void |
close()
|
TRECDocumentCollection |
copy()
|
it.unimi.dsi.mg4j.document.Document |
document(int n)
Returns a document |
protected static boolean |
equals(byte[] a,
int len,
byte[] b)
|
it.unimi.dsi.mg4j.document.DocumentFactory |
factory()
Returns the underlying document factory |
it.unimi.dsi.mg4j.document.DocumentIterator |
iterator()
Returns the iterator over the documents. |
static void |
main(String[] arg)
|
void |
merge(TRECDocumentCollection other)
Merges a new collection in this one, by rebuilding the gzFile array and appending the other object one, concatenating the descriptors while rebuilding all. |
it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> |
metadata(int index)
|
protected void |
parseContent(int fileIndex,
InputStream is)
Process one of the file in order to find the blocks. |
static void |
run(TRECDocumentCollection.Options options,
String[] file)
Parses the document collection and finally stores the TRECDocumentCollection in a file |
int |
size()
|
InputStream |
stream(int n)
|
Methods inherited from class uk.ac.gla.dcs.renaissance.mg4j.AbstractDocumentCollection |
---|
getReverseIDMap, getReverseIDMap |
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection |
---|
ensureDocumentIndex, printAllDocuments, toString |
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentSequence |
---|
filename, finalize, load |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentSequence |
---|
filename |
Field Detail |
---|
public static final transient int DEFAULT_BUFFER_SIZE
protected it.unimi.dsi.mg4j.document.DocumentFactory factory
protected transient it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors
Constructor Detail |
---|
public TRECDocumentCollection(String[] file, it.unimi.dsi.mg4j.document.DocumentFactory factory, int bufferSize, TRECDocumentCollection.Compression compression) throws IOException
file
- an array of file names containing documents in TREC GOV2
format.factory
- the document factory (usually, a composite one).bufferSize
- the buffer size.compression
- true if the files are gzipped.
IOException
protected TRECDocumentCollection(String[] file, it.unimi.dsi.mg4j.document.DocumentFactory factory, it.unimi.dsi.fastutil.objects.ObjectArrayList<TRECDocumentCollection.TRECDocumentDescriptor> descriptors, int bufferSize, TRECDocumentCollection.Compression compression)
copy()
. Just
initializes final fields
Method Detail |
---|
protected static boolean equals(byte[] a, int len, byte[] b)
public static void main(String[] arg) throws IOException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, bpiwowar.argparser.InvalidHolderException, org.apache.commons.configuration.ConfigurationException, bpiwowar.argparser.ArgParserException
IOException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
bpiwowar.argparser.InvalidHolderException
org.apache.commons.configuration.ConfigurationException
bpiwowar.argparser.ArgParserException
public static void run(TRECDocumentCollection.Options options, String[] file) throws IOException, org.apache.commons.configuration.ConfigurationException
options
- the set of optionsfile
- the list of document files to parse
IOException
org.apache.commons.configuration.ConfigurationException
public void close() throws IOException
close
in interface it.unimi.dsi.mg4j.document.DocumentSequence
close
in interface Closeable
close
in class it.unimi.dsi.mg4j.document.AbstractDocumentSequence
IOException
public TRECDocumentCollection copy()
copy
in interface it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentCollection>
copy
in interface it.unimi.dsi.mg4j.document.DocumentCollection
public it.unimi.dsi.mg4j.document.Document document(int n) throws IOException
document
in interface it.unimi.dsi.mg4j.document.DocumentCollection
n
- number of the document to return
IOException
public it.unimi.dsi.mg4j.document.DocumentFactory factory()
factory
in interface it.unimi.dsi.mg4j.document.DocumentSequence
public it.unimi.dsi.mg4j.document.DocumentIterator iterator() throws IOException
iterator
in interface it.unimi.dsi.mg4j.document.DocumentSequence
iterator
in class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
IOException
public void merge(TRECDocumentCollection other)
It is supposed that the passed object contains no duplicates for the local collection.
public it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata(int index)
metadata
in interface it.unimi.dsi.mg4j.document.DocumentCollection
protected void parseContent(int fileIndex, InputStream is) throws IOException
TRECDocumentCollection.TRECDocumentDescriptor
in this class).
fileIndex
- The index in the file arrayis
- The input stream for this file
IOException
public int size()
size
in interface it.unimi.dsi.mg4j.document.DocumentCollection
public InputStream stream(int n) throws IOException
stream
in interface it.unimi.dsi.mg4j.document.DocumentCollection
IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |