| DensityRulesClassifier |
Classifies TextBlocks as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in the
paper "Boilerplate Detection using Shallow Text Features", particularly using
text densities and link densities.
|
| IgnoreBlocksAfterContentFilter |
|
| IgnoreBlocksAfterContentFromEndFilter |
|
| KeepLargestFulltextBlockFilter |
Keeps the largest TextBlock only (by the number of words).
|
| MinFulltextWordsFilter |
Keeps only those content blocks which contain at least k full-text words
(measured by HeuristicFilterBase.getNumFullTextWords(TextBlock)).
|
| NumWordsRulesClassifier |
Classifies TextBlocks as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block.
|
| TerminatingBlocksFinder |
|