Package de.l3s.boilerpipe.extractors
Class ExtractorBase
- java.lang.Object
-
- de.l3s.boilerpipe.extractors.ExtractorBase
-
- All Implemented Interfaces:
BoilerpipeExtractor,BoilerpipeFilter
- Direct Known Subclasses:
ArticleExtractor,ArticleSentencesExtractor,CanolaExtractor,DefaultExtractor,KeepEverythingExtractor,KeepEverythingWithMinKWordsExtractor,LargestContentExtractor,NumWordsRulesExtractor
public abstract class ExtractorBase extends java.lang.Object implements BoilerpipeExtractor
The base class of Extractors. Also provides some helper methods to quickly retrieve the text that remained after processing.
-
-
Constructor Summary
Constructors Constructor Description ExtractorBase()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringgetText(TextDocument doc)Extracts text from the givenTextDocumentobject.java.lang.StringgetText(java.io.Reader r)Extracts text from the HTML code available from the givenReader.java.lang.StringgetText(java.lang.String html)Extracts text from the HTML code given as a String.java.lang.StringgetText(java.net.URL url)Extracts text from the HTML code available from the givenURL.java.lang.StringgetText(org.xml.sax.InputSource is)Extracts text from the HTML code available from the givenInputSource.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface de.l3s.boilerpipe.BoilerpipeFilter
process
-
-
-
-
Method Detail
-
getText
public java.lang.String getText(java.lang.String html) throws BoilerpipeProcessingExceptionExtracts text from the HTML code given as a String.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
html- The HTML code as a String.- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
public java.lang.String getText(org.xml.sax.InputSource is) throws BoilerpipeProcessingExceptionExtracts text from the HTML code available from the givenInputSource.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
is- The InputSource containing the HTML- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
public java.lang.String getText(java.net.URL url) throws BoilerpipeProcessingExceptionExtracts text from the HTML code available from the givenURL. NOTE: This method is mainly to be used for show case purposes. If you are going to crawl the Web, consider usinggetText(InputSource)instead.- Parameters:
url- The URL pointing to the HTML code.- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
public java.lang.String getText(java.io.Reader r) throws BoilerpipeProcessingExceptionExtracts text from the HTML code available from the givenReader.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
r- The Reader containing the HTML- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
public java.lang.String getText(TextDocument doc) throws BoilerpipeProcessingException
Extracts text from the givenTextDocumentobject.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
doc- TheTextDocument.- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
-