|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object uk.ac.gla.dcs.renaissance.mg4j.trec.TRECSegmentedTextExtractor
public class TRECSegmentedTextExtractor
A callback extracting text and titles for TREC documents. This implementation also keeps track of the different segments (paragraphs) within the text. It also stores the positions of elements and tags.
This callbacks extracts all text in the page, and the title. The resulting
text is available through text
, and the title through title
. Furthermore, the segments of the resulting text are preserved.
The get access to all segments, use #getSegmentIterator()
. Within
#SegmentIterator
, invoke #getSentenceIterator()
to access all
sentences of the segment.
Field Summary | |
---|---|
it.unimi.dsi.lang.MutableString |
text
The text resulting from the parsing process. |
it.unimi.dsi.lang.MutableString |
title
The title resulting from the parsing process. |
Fields inherited from interface it.unimi.dsi.parser.callback.Callback |
---|
EMPTY_CALLBACK_ARRAY |
Constructor Summary | |
---|---|
TRECSegmentedTextExtractor()
|
Method Summary | |
---|---|
boolean |
cdata(it.unimi.dsi.parser.Element element,
char[] text,
int offset,
int length)
|
boolean |
characters(char[] characters,
int offset,
int length,
boolean flowBroken)
|
void |
configure(it.unimi.dsi.parser.BulletParser parser)
|
void |
endDocument()
|
boolean |
endElement(it.unimi.dsi.parser.Element element)
|
List<TagPointer> |
getPointers()
|
it.unimi.dsi.lang.MutableString |
getText()
Returns the text. |
void |
startDocument()
|
boolean |
startElement(it.unimi.dsi.parser.Element element,
Map<it.unimi.dsi.parser.Attribute,it.unimi.dsi.lang.MutableString> attrMapUnused)
|
Iterator<TagPointer> |
tagPointer()
Returns the tag pointers |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public final it.unimi.dsi.lang.MutableString text
public final it.unimi.dsi.lang.MutableString title
Constructor Detail |
---|
public TRECSegmentedTextExtractor()
Method Detail |
---|
public Iterator<TagPointer> tagPointer()
public void startDocument()
startDocument
in interface it.unimi.dsi.parser.callback.Callback
public List<TagPointer> getPointers()
public boolean characters(char[] characters, int offset, int length, boolean flowBroken)
characters
in interface it.unimi.dsi.parser.callback.Callback
public boolean endElement(it.unimi.dsi.parser.Element element)
endElement
in interface it.unimi.dsi.parser.callback.Callback
public boolean startElement(it.unimi.dsi.parser.Element element, Map<it.unimi.dsi.parser.Attribute,it.unimi.dsi.lang.MutableString> attrMapUnused)
startElement
in interface it.unimi.dsi.parser.callback.Callback
public it.unimi.dsi.lang.MutableString getText()
public boolean cdata(it.unimi.dsi.parser.Element element, char[] text, int offset, int length)
cdata
in interface it.unimi.dsi.parser.callback.Callback
public void endDocument()
endDocument
in interface it.unimi.dsi.parser.callback.Callback
public void configure(it.unimi.dsi.parser.BulletParser parser)
configure
in interface it.unimi.dsi.parser.callback.Callback
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |