Package de.jungblut.crawl.extraction
Class ArticleContentExtrator
- java.lang.Object
-
- de.jungblut.crawl.extraction.ArticleContentExtrator
-
- All Implemented Interfaces:
Extractor<ArticleContentExtrator.ContentFetchResult>
public final class ArticleContentExtrator extends java.lang.Object implements Extractor<ArticleContentExtrator.ContentFetchResult>
Extractor for news articles. Uses BoilerpipesArticleExtractorto extract the largest block of text and the article title.- Author:
- thomas.jungblut
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classArticleContentExtrator.ContentFetchResultArticle content fetch result.
-
Constructor Summary
Constructors Constructor Description ArticleContentExtrator()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description ArticleContentExtrator.ContentFetchResultextract(java.lang.String site)Extracts from a given URL all the content needed and return it.static java.lang.StringextractTitle(java.lang.String html)Extracts the title from the given HTML.static voidmain(java.lang.String[] args)
-
-
-
Method Detail
-
extract
public ArticleContentExtrator.ContentFetchResult extract(java.lang.String site)
Description copied from interface:ExtractorExtracts from a given URL all the content needed and return it. Null if nothing should be returned or could be parsed.- Specified by:
extractin interfaceExtractor<ArticleContentExtrator.ContentFetchResult>
-
extractTitle
public static java.lang.String extractTitle(java.lang.String html) throws org.htmlparser.util.ParserExceptionExtracts the title from the given HTML.- Returns:
- never null, just an empty string if not parsable.
- Throws:
org.htmlparser.util.ParserException
-
main
public static void main(java.lang.String[] args) throws java.io.IOException, java.lang.InterruptedException, java.util.concurrent.ExecutionException- Throws:
java.io.IOExceptionjava.lang.InterruptedExceptionjava.util.concurrent.ExecutionException
-
-