|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectuk.ac.gla.dcs.renaissance.mg4j.trec.TRECSegmentedTextExtractor
public class TRECSegmentedTextExtractor
A callback extracting text and titles for TREC documents. This implementation also keeps track of the different segments (paragraphs) within the text. It also stores the positions of elements and tags.
This callbacks extracts all text in the page, and the title. The resulting
text is available through text, and the title through title
. Furthermore, the segments of the resulting text are preserved.
The get access to all segments, use #getSegmentIterator(). Within
#SegmentIterator, invoke #getSentenceIterator() to access all
sentences of the segment.
| Field Summary | |
|---|---|
it.unimi.dsi.lang.MutableString |
text
The text resulting from the parsing process. |
it.unimi.dsi.lang.MutableString |
title
The title resulting from the parsing process. |
| Fields inherited from interface it.unimi.dsi.parser.callback.Callback |
|---|
EMPTY_CALLBACK_ARRAY |
| Constructor Summary | |
|---|---|
TRECSegmentedTextExtractor()
|
|
| Method Summary | |
|---|---|
boolean |
cdata(it.unimi.dsi.parser.Element element,
char[] text,
int offset,
int length)
|
boolean |
characters(char[] characters,
int offset,
int length,
boolean flowBroken)
|
void |
configure(it.unimi.dsi.parser.BulletParser parser)
|
void |
endDocument()
|
boolean |
endElement(it.unimi.dsi.parser.Element element)
|
List<TagPointer> |
getPointers()
|
it.unimi.dsi.lang.MutableString |
getText()
Returns the text. |
void |
startDocument()
|
boolean |
startElement(it.unimi.dsi.parser.Element element,
Map<it.unimi.dsi.parser.Attribute,it.unimi.dsi.lang.MutableString> attrMapUnused)
|
Iterator<TagPointer> |
tagPointer()
Returns the tag pointers |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public final it.unimi.dsi.lang.MutableString text
public final it.unimi.dsi.lang.MutableString title
| Constructor Detail |
|---|
public TRECSegmentedTextExtractor()
| Method Detail |
|---|
public Iterator<TagPointer> tagPointer()
public void startDocument()
startDocument in interface it.unimi.dsi.parser.callback.Callbackpublic List<TagPointer> getPointers()
public boolean characters(char[] characters,
int offset,
int length,
boolean flowBroken)
characters in interface it.unimi.dsi.parser.callback.Callbackpublic boolean endElement(it.unimi.dsi.parser.Element element)
endElement in interface it.unimi.dsi.parser.callback.Callback
public boolean startElement(it.unimi.dsi.parser.Element element,
Map<it.unimi.dsi.parser.Attribute,it.unimi.dsi.lang.MutableString> attrMapUnused)
startElement in interface it.unimi.dsi.parser.callback.Callbackpublic it.unimi.dsi.lang.MutableString getText()
public boolean cdata(it.unimi.dsi.parser.Element element,
char[] text,
int offset,
int length)
cdata in interface it.unimi.dsi.parser.callback.Callbackpublic void endDocument()
endDocument in interface it.unimi.dsi.parser.callback.Callbackpublic void configure(it.unimi.dsi.parser.BulletParser parser)
configure in interface it.unimi.dsi.parser.callback.Callback
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||