| AddPrecedingLabelsFilter |
Adds the labels of the preceding block to the current block, optionally adding a prefix.
|
| ArticleExtractor |
A full-text extractor which is tuned towards news articles.
|
| ArticleMetadataFilter |
|
| ArticleSentencesExtractor |
A full-text extractor which is tuned towards extracting sentences from news articles.
|
| BlockProximityFusion |
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
|
| BoilerpipeDocumentSource |
|
| BoilerpipeExtractor |
Describes a complete filter pipeline.
|
| BoilerpipeFilter |
|
| BoilerpipeHTMLContentHandler |
|
| BoilerpipeHTMLParser |
|
| BoilerpipeInput |
|
| BoilerpipeProcessingException |
Exception for signaling failure in the processing pipeline.
|
| BoilerpipeSAXInput |
|
| BoilerplateBlockFilter |
Removes TextBlocks which have explicitly been marked as "not content".
|
| CanolaExtractor |
|
| CommonExtractors |
|
| CommonTagActions |
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
|
| CommonTagActions.BlockTagLabelAction |
|
| CommonTagActions.Chained |
|
| CommonTagActions.InlineTagLabelAction |
|
| ConditionalLabelAction |
Adds labels to a TextBlock if the given criteria are met.
|
| ContentFusion |
|
| DefaultExtractor |
A quite generic full-text extractor.
|
| DefaultLabels |
|
| DefaultTagActionMap |
|
| DensityRulesClassifier |
Classifies TextBlocks as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in the
paper "Boilerplate Detection using Shallow Text Features", particularly using
text densities and link densities.
|
| DocumentTitleMatchClassifier |
Marks TextBlocks which contain parts of the HTML
<TITLE> tag, using some heuristics which are quite
specific to the news domain.
|
| ExpandTitleToContentFilter |
|
| ExtractorBase |
The base class of Extractors.
|
| Grabber |
|
| GrabberRecord |
|
| HTMLDocument |
|
| HTMLElements |
Collection of HTML element information.
|
| HTMLElements.Element |
Element information.
|
| HTMLElements.ElementList |
Unsynchronized list of elements.
|
| HTMLFetcher |
A very simple HTTP/HTML fetcher, really just for demo purposes.
|
| HTMLHighlighter |
Highlights text blocks in an HTML document that have been marked as "content"
in the corresponding TextDocument.
|
| HTMLTagBalancer |
Balances tags in an HTML document.
|
| HTMLTagBalancer.Info |
Element info for each start element.
|
| HTMLTagBalancer.InfoStack |
Unsynchronized stack of element information.
|
| IgnoreBlocksAfterContentFilter |
|
| IgnoreBlocksAfterContentFromEndFilter |
|
| InputSourceable |
An InputSourceable can return an arbitrary number of new InputSources for a given document.
|
| InvertedFilter |
Reverts the "isContent" flag for all TextBlocks
|
| KeepEverythingExtractor |
Marks everything as content.
|
| KeepEverythingWithMinKWordsExtractor |
A full-text extractor which extracts the largest text component of a page.
|
| KeepLargestBlockFilter |
Keeps the largest TextBlock only (by the number of words).
|
| KeepLargestFulltextBlockFilter |
Keeps the largest TextBlock only (by the number of words).
|
| LabelAction |
|
| LabelFusion |
Fuses adjacent blocks if their labels are equal.
|
| LabelToBoilerplateFilter |
Marks all blocks that contain a given label as "boilerplate".
|
| LabelToContentFilter |
Marks all blocks that contain a given label as "content".
|
| LargestContentExtractor |
A full-text extractor which extracts the largest text component of a page.
|
| MarkEverythingContentFilter |
Marks all blocks as content.
|
| MarkupTagAction |
Assigns labels for element CSS classes and ids to the corresponding
TextBlock.
|
| MinClauseWordsFilter |
Keeps only blocks that have at least one segment fragment ("clause") with at
least k words (default: 5).
|
| MinFulltextWordsFilter |
Keeps only those content blocks which contain at least k full-text words
(measured by HeuristicFilterBase.getNumFullTextWords(TextBlock)).
|
| MinWordsFilter |
Keeps only those content blocks which contain at least k words.
|
| NumWordsRulesClassifier |
Classifies TextBlocks as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block.
|
| NumWordsRulesExtractor |
A quite generic full-text extractor solely based upon the number of words per
block (the current, the previous and the next block).
|
| SimpleBlockFusionProcessor |
Merges two subsequent blocks if their text densities are equal.
|
| SimpleEstimator |
|
| SplitParagraphBlocksFilter |
Splits TextBlocks at paragraph boundaries.
|
| SurroundingToContentFilter |
|
| TagAction |
Defines an action that is to be performed whenever a particular tag occurs
during HTML parsing.
|
| TagActionMap |
Base class for definition a set of TagActions that are to be used for the
HTML parsing process.
|
| TerminatingBlocksFinder |
|
| TextBlock |
Describes a block of text.
|
| TextBlockCondition |
Evaluates whether a given TextBlock meets a certain condition.
|
| TextDocument |
A text document, consisting of one or more TextBlocks.
|
| TextDocumentStatistics |
Provides shallow statistics on a given TextDocument
|
| UnicodeTokenizer |
Tokenizes text according to Unicode word boundaries and strips off non-word
characters.
|