|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectit.unimi.dsi.parser.callback.DefaultCallback
it.unimi.dsi.parser.callback.TextExtractor
public class TextExtractor
A callback extracting text and titles.
This callbacks extracts all text in the page, and the title.
The resulting
text is available through text, and the title through title.
Note that text and title are never trimmed.
| Field Summary | |
|---|---|
MutableString |
text
The text resulting from the parsing process. |
MutableString |
title
The title resulting from the parsing process. |
| Fields inherited from interface it.unimi.dsi.parser.callback.Callback |
|---|
EMPTY_CALLBACK_ARRAY |
| Constructor Summary | |
|---|---|
TextExtractor()
|
|
| Method Summary | |
|---|---|
boolean |
characters(char[] characters,
int offset,
int length,
boolean flowBroken)
Receive notification of character data inside an element. |
void |
configure(BulletParser parser)
Configure the parser to parse text. |
boolean |
endElement(Element element)
Receive notification of the end of an element. |
void |
startDocument()
Receive notification of the beginning of the document. |
boolean |
startElement(Element element,
Map<Attribute,MutableString> attrMapUnused)
Receive notification of the start of an element. |
| Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback |
|---|
cdata, endDocument, getInstance |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public final MutableString text
public final MutableString title
| Constructor Detail |
|---|
public TextExtractor()
| Method Detail |
|---|
public void configure(BulletParser parser)
configure in interface Callbackconfigure in class DefaultCallbackpublic void startDocument()
CallbackThe callback must use this method to reset its internal state so that it can be resued. It must be safe to invoke this method several times.
startDocument in interface CallbackstartDocument in class DefaultCallback
public boolean characters(char[] characters,
int offset,
int length,
boolean flowBroken)
CallbackYou must not write into text, as it could be passed
around to many callbacks.
flowBroken will be true iff
the flow was broken before text. This feature makes it possible
to extract quickly the text in a document without looking at the elements.
characters in interface Callbackcharacters in class DefaultCallbackcharacters - an array containing the character data.offset - the start position in the array.length - the number of characters to read from the array.flowBroken - whether the flow is broken at the start of text.
public boolean endElement(Element element)
CallbackThis method will never be called for element without closing tags, even if such a tag is found.
endElement in interface CallbackendElement in class DefaultCallbackelement - the element whose closing tag was found.
public boolean startElement(Element element,
Map<Attribute,MutableString> attrMapUnused)
CallbackFor simple elements, this is the only notification that the callback will ever receive.
startElement in interface CallbackstartElement in class DefaultCallbackelement - the element whose opening tag was found.attrMapUnused - a map from Attributes to MutableStrings.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||