uk.ac.gla.dcs.renaissance.mg4j.trec
Class TRECSegmentedTextExtractor

java.lang.Object
  extended by uk.ac.gla.dcs.renaissance.mg4j.trec.TRECSegmentedTextExtractor
All Implemented Interfaces:
it.unimi.dsi.parser.callback.Callback

public class TRECSegmentedTextExtractor
extends Object
implements it.unimi.dsi.parser.callback.Callback

A callback extracting text and titles for TREC documents. This implementation also keeps track of the different segments (paragraphs) within the text. It also stores the positions of elements and tags.

This callbacks extracts all text in the page, and the title. The resulting text is available through text, and the title through title . Furthermore, the segments of the resulting text are preserved.

The get access to all segments, use #getSegmentIterator(). Within #SegmentIterator, invoke #getSentenceIterator() to access all sentences of the segment.

Author:
Ingo Frommholz <ingo@dcs.gla.ac.uk>

Field Summary
 it.unimi.dsi.lang.MutableString text
          The text resulting from the parsing process.
 it.unimi.dsi.lang.MutableString title
          The title resulting from the parsing process.
 
Fields inherited from interface it.unimi.dsi.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
 
Constructor Summary
TRECSegmentedTextExtractor()
           
 
Method Summary
 boolean cdata(it.unimi.dsi.parser.Element element, char[] text, int offset, int length)
           
 boolean characters(char[] characters, int offset, int length, boolean flowBroken)
           
 void configure(it.unimi.dsi.parser.BulletParser parser)
           
 void endDocument()
           
 boolean endElement(it.unimi.dsi.parser.Element element)
           
 List<TagPointer> getPointers()
           
 it.unimi.dsi.lang.MutableString getText()
          Returns the text.
 void startDocument()
           
 boolean startElement(it.unimi.dsi.parser.Element element, Map<it.unimi.dsi.parser.Attribute,it.unimi.dsi.lang.MutableString> attrMapUnused)
           
 Iterator<TagPointer> tagPointer()
          Returns the tag pointers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

text

public final it.unimi.dsi.lang.MutableString text
The text resulting from the parsing process.


title

public final it.unimi.dsi.lang.MutableString title
The title resulting from the parsing process.

Constructor Detail

TRECSegmentedTextExtractor

public TRECSegmentedTextExtractor()
Method Detail

tagPointer

public Iterator<TagPointer> tagPointer()
Returns the tag pointers

Returns:
Iterator over tag pointers

startDocument

public void startDocument()
Specified by:
startDocument in interface it.unimi.dsi.parser.callback.Callback

getPointers

public List<TagPointer> getPointers()

characters

public boolean characters(char[] characters,
                          int offset,
                          int length,
                          boolean flowBroken)
Specified by:
characters in interface it.unimi.dsi.parser.callback.Callback

endElement

public boolean endElement(it.unimi.dsi.parser.Element element)
Specified by:
endElement in interface it.unimi.dsi.parser.callback.Callback

startElement

public boolean startElement(it.unimi.dsi.parser.Element element,
                            Map<it.unimi.dsi.parser.Attribute,it.unimi.dsi.lang.MutableString> attrMapUnused)
Specified by:
startElement in interface it.unimi.dsi.parser.callback.Callback

getText

public it.unimi.dsi.lang.MutableString getText()
Returns the text. Note that the text may not be trimmed.

Returns:
the text

cdata

public boolean cdata(it.unimi.dsi.parser.Element element,
                     char[] text,
                     int offset,
                     int length)
Specified by:
cdata in interface it.unimi.dsi.parser.callback.Callback

endDocument

public void endDocument()
Specified by:
endDocument in interface it.unimi.dsi.parser.callback.Callback

configure

public void configure(it.unimi.dsi.parser.BulletParser parser)
Specified by:
configure in interface it.unimi.dsi.parser.callback.Callback


Copyright © 2011. All Rights Reserved.