Class KeepLargestFulltextBlockFilter
- java.lang.Object
-
- de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
-
- All Implemented Interfaces:
BoilerpipeFilter
public final class KeepLargestFulltextBlockFilter extends java.lang.Object implements BoilerpipeFilter
Keeps the largestTextBlockonly (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged asDefaultLabels.MIGHT_BE_CONTENT. As opposed toKeepLargestBlockFilter, the number of words are computed usingHeuristicFilterBase.getNumFullTextWords(TextBlock), which only counts words that occur in text elements with at least 9 words and are thus believed to be full text. NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter may lead to suboptimal results. You better useKeepLargestBlockFilterinstead, which works at the level of number-of-words instead of text densities.
-
-
Field Summary
Fields Modifier and Type Field Description static KeepLargestFulltextBlockFilterINSTANCE
-
Constructor Summary
Constructors Constructor Description KeepLargestFulltextBlockFilter()
-
Method Summary
Modifier and Type Method Description protected static intgetNumFullTextWords(TextBlock tb)protected static intgetNumFullTextWords(TextBlock tb, float minTextDensity)booleanprocess(TextDocument doc)Processes the given documentdoc.
-
-
-
Field Detail
-
INSTANCE
public static final KeepLargestFulltextBlockFilter INSTANCE
-
-
Method Detail
-
process
public boolean process(TextDocument doc) throws BoilerpipeProcessingException
Description copied from interface:BoilerpipeFilterProcesses the given documentdoc.- Specified by:
processin interfaceBoilerpipeFilter- Parameters:
doc- TheTextDocumentthat is to be processed.- Returns:
trueif changes have been made to theTextDocument.- Throws:
BoilerpipeProcessingException
-
getNumFullTextWords
protected static int getNumFullTextWords(TextBlock tb)
-
getNumFullTextWords
protected static int getNumFullTextWords(TextBlock tb, float minTextDensity)
-
-