uk.ac.gla.dcs.renaissance.mg4j.trec
Class TRECDocumentFactory

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
          extended by uk.ac.gla.dcs.renaissance.mg4j.trec.TRECDocumentFactory
All Implemented Interfaces:
it.unimi.dsi.lang.FlyweightPrototype<it.unimi.dsi.mg4j.document.DocumentFactory>, it.unimi.dsi.mg4j.document.DocumentFactory, Serializable

public class TRECDocumentFactory
extends it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory

A factory that provides fields for body and title of HTML documents. It uses internally a BulletParser. A default encoding can be provided using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING .

See Also:
Serialized Form

Nested Class Summary
static class TRECDocumentFactory.CollectionType
           
static class TRECDocumentFactory.Fields
           
 class TRECDocumentFactory.TRECSegmentedDocument
          A TREC document.
 
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys
 
Nested classes/interfaces inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory
it.unimi.dsi.mg4j.document.DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
TRECDocumentFactory()
           
TRECDocumentFactory(it.unimi.dsi.util.Properties properties)
           
TRECDocumentFactory(it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
           
TRECDocumentFactory(String[] property)
           
 
Method Summary
 TRECDocumentFactory copy()
          Returns a copy of this document factory.
 int fieldIndex(String fieldName)
           
 String fieldName(int field)
           
 it.unimi.dsi.mg4j.document.DocumentFactory.FieldType fieldType(int field)
           
 TRECDocumentFactory.TRECSegmentedDocument getDocument(InputStream rawContent, it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata)
           
 it.unimi.dsi.io.WordReader getWordReader()
           
 int numberOfFields()
           
protected  boolean parseProperty(String key, String[] values, it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata)
           
 void setCollectionType(TRECDocumentFactory.CollectionType t)
          Sets the type of the underlying collection (e.g.
 
Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TRECDocumentFactory

public TRECDocumentFactory(it.unimi.dsi.util.Properties properties)
                    throws org.apache.commons.configuration.ConfigurationException
Throws:
org.apache.commons.configuration.ConfigurationException

TRECDocumentFactory

public TRECDocumentFactory(it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> defaultMetadata)

TRECDocumentFactory

public TRECDocumentFactory(String[] property)
                    throws org.apache.commons.configuration.ConfigurationException
Throws:
org.apache.commons.configuration.ConfigurationException

TRECDocumentFactory

public TRECDocumentFactory()
Method Detail

parseProperty

protected boolean parseProperty(String key,
                                String[] values,
                                it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata)
                         throws org.apache.commons.configuration.ConfigurationException
Overrides:
parseProperty in class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
Throws:
org.apache.commons.configuration.ConfigurationException

copy

public TRECDocumentFactory copy()
Returns a copy of this document factory. A new parser is allocated for the copy.


numberOfFields

public int numberOfFields()

fieldName

public String fieldName(int field)

fieldIndex

public int fieldIndex(String fieldName)

fieldType

public it.unimi.dsi.mg4j.document.DocumentFactory.FieldType fieldType(int field)

getDocument

public TRECDocumentFactory.TRECSegmentedDocument getDocument(InputStream rawContent,
                                                             it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Enum<?>,Object> metadata)
                                                      throws IOException
Throws:
IOException

getWordReader

public it.unimi.dsi.io.WordReader getWordReader()

setCollectionType

public void setCollectionType(TRECDocumentFactory.CollectionType t)
Sets the type of the underlying collection (e.g. standard TREC collection, WARC collection)

Parameters:
t - the collection type


Copyright © 2011. All Rights Reserved.