Package de.l3s.boilerpipe.sax
Class HTMLHighlighter
- java.lang.Object
-
- de.l3s.boilerpipe.sax.HTMLHighlighter
-
public final class HTMLHighlighter extends java.lang.ObjectHighlights text blocks in an HTML document that have been marked as "content" in the correspondingTextDocument.
-
-
Method Summary
Modifier and Type Method Description java.lang.StringgetExtraStyleSheet()Returns the extra stylesheet definition that will be inserted in the HEAD element.java.lang.StringgetPostHighlight()Returns the string that will be inserted after any highlighted HTML block.java.lang.StringgetPreHighlight()Returns the string that will be inserted before any highlighted HTML block.booleanisOutputHighlightOnly()If true, only HTML enclosed within highlighted content will be returnedstatic HTMLHighlighternewExtractingInstance()Creates a newHTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.static HTMLHighlighternewHighlightingInstance()Creates a newHTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.java.lang.Stringprocess(TextDocument doc, java.lang.String origHTML)Processes the givenTextDocumentand the original HTML text (as a String).java.lang.Stringprocess(TextDocument doc, org.xml.sax.InputSource is)Processes the givenTextDocumentand the original HTML text (as anInputSource).java.lang.Stringprocess(java.net.URL url, BoilerpipeExtractor extractor)voidsetExtraStyleSheet(java.lang.String extraStyleSheet)Sets the extra stylesheet definition that will be inserted in the HEAD element.voidsetOutputHighlightOnly(boolean outputHighlightOnly)Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.voidsetPostHighlight(java.lang.String postHighlight)Sets the string that will be inserted after any highlighted HTML block.voidsetPreHighlight(java.lang.String preHighlight)Sets the string that will be inserted prior to any highlighted HTML block.
-
-
-
Method Detail
-
newHighlightingInstance
public static HTMLHighlighter newHighlightingInstance()
Creates a newHTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.
-
newExtractingInstance
public static HTMLHighlighter newExtractingInstance()
Creates a newHTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.
-
process
public java.lang.String process(TextDocument doc, java.lang.String origHTML) throws BoilerpipeProcessingException
Processes the givenTextDocumentand the original HTML text (as a String).- Parameters:
doc- The processedTextDocument.origHTML- The original HTML document.- Throws:
BoilerpipeProcessingException
-
process
public java.lang.String process(TextDocument doc, org.xml.sax.InputSource is) throws BoilerpipeProcessingException
Processes the givenTextDocumentand the original HTML text (as anInputSource).- Parameters:
doc- The processedTextDocument.is- The original HTML document.- Throws:
BoilerpipeProcessingException
-
process
public java.lang.String process(java.net.URL url, BoilerpipeExtractor extractor) throws java.io.IOException, BoilerpipeProcessingException, org.xml.sax.SAXException- Throws:
java.io.IOExceptionBoilerpipeProcessingExceptionorg.xml.sax.SAXException
-
isOutputHighlightOnly
public boolean isOutputHighlightOnly()
If true, only HTML enclosed within highlighted content will be returned
-
setOutputHighlightOnly
public void setOutputHighlightOnly(boolean outputHighlightOnly)
Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.
-
getExtraStyleSheet
public java.lang.String getExtraStyleSheet()
Returns the extra stylesheet definition that will be inserted in the HEAD element. By default, this corresponds to a simple definition that marks text in class "x-boilerpipe-mark1" as inline text with yellow background.
-
setExtraStyleSheet
public void setExtraStyleSheet(java.lang.String extraStyleSheet)
Sets the extra stylesheet definition that will be inserted in the HEAD element. To disable, set it to the empty string: ""- Parameters:
extraStyleSheet- Plain HTML
-
getPreHighlight
public java.lang.String getPreHighlight()
Returns the string that will be inserted before any highlighted HTML block. By default, this corresponds to<span class=&qupt;x-boilerpipe-mark1">
-
setPreHighlight
public void setPreHighlight(java.lang.String preHighlight)
Sets the string that will be inserted prior to any highlighted HTML block. To disable, set it to the empty string: ""
-
getPostHighlight
public java.lang.String getPostHighlight()
Returns the string that will be inserted after any highlighted HTML block. By default, this corresponds to</span>
-
setPostHighlight
public void setPostHighlight(java.lang.String postHighlight)
Sets the string that will be inserted after any highlighted HTML block. To disable, set it to the empty string: ""
-
-