All Classes
-
All Classes Interface Summary Class Summary Exception Summary Class Description AddPrecedingLabelsFilter Adds the labels of the preceding block to the current block, optionally adding a prefix.ArticleExtractor A full-text extractor which is tuned towards news articles.ArticleMetadataFilter ArticleSentencesExtractor A full-text extractor which is tuned towards extracting sentences from news articles.BlockProximityFusion Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.BoilerpipeDocumentSource Something that can be represented as aTextDocument.BoilerpipeExtractor Describes a complete filter pipeline.BoilerpipeFilter A genericBoilerpipeFilter.BoilerpipeHTMLContentHandler A simple SAXContentHandler, used byBoilerpipeSAXInput.BoilerpipeHTMLParser A simple SAX Parser, used byBoilerpipeSAXInput.BoilerpipeInput A source that returnsTextDocuments.BoilerpipeProcessingException Exception for signaling failure in the processing pipeline.BoilerpipeSAXInput Parses anInputSourceusing SAX and returns aTextDocument.BoilerplateBlockFilter RemovesTextBlocks which have explicitly been marked as "not content".CanolaExtractor CommonExtractors Provides quick access to commonBoilerpipeExtractors.CommonTagActions Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.CommonTagActions.BlockTagLabelAction CommonTagActionsfor block-level elements, which triggers someLabelActionon the generatedTextBlock.CommonTagActions.Chained CommonTagActions.InlineTagLabelAction ConditionalLabelAction Adds labels to aTextBlockif the given criteria are met.ContentFusion DefaultExtractor A quite generic full-text extractor.DefaultLabels Some pre-defined labels which can be used in conjunction withTextBlock.addLabel(String)andTextBlock.hasLabel(String).DefaultTagActionMap DefaultTagActions.DensityRulesClassifier ClassifiesTextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.DocumentTitleMatchClassifier MarksTextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain.ExpandTitleToContentFilter Marks allTextBlocks "content" which are between the headline and the part that has already been marked content, if they are markedDefaultLabels.MIGHT_BE_CONTENT.ExtractorBase The base class of Extractors.Grabber GrabberRecord HTMLDocument AnInputSourceableforHTMLFetcher.HTMLElements Collection of HTML element information.HTMLElements.Element Element information.HTMLElements.ElementList Unsynchronized list of elements.HTMLFetcher A very simple HTTP/HTML fetcher, really just for demo purposes.HTMLHighlighter Highlights text blocks in an HTML document that have been marked as "content" in the correspondingTextDocument.HTMLTagBalancer Balances tags in an HTML document.HTMLTagBalancer.Info Element info for each start element.HTMLTagBalancer.InfoStack Unsynchronized stack of element information.IgnoreBlocksAfterContentFilter Marks all blocks as "non-content" that occur after blocks that have been markedDefaultLabels.INDICATES_END_OF_TEXT.IgnoreBlocksAfterContentFromEndFilter Marks all blocks as "non-content" that occur after blocks that have been markedDefaultLabels.INDICATES_END_OF_TEXT, and after any content block.InputSourceable An InputSourceable can return an arbitrary number of newInputSources for a given document.InvertedFilter Reverts the "isContent" flag for allTextBlocksKeepEverythingExtractor Marks everything as content.KeepEverythingWithMinKWordsExtractor A full-text extractor which extracts the largest text component of a page.KeepLargestBlockFilter Keeps the largestTextBlockonly (by the number of words).KeepLargestFulltextBlockFilter Keeps the largestTextBlockonly (by the number of words).LabelAction Helps adding labels toTextBlocks.LabelFusion Fuses adjacent blocks if their labels are equal.LabelToBoilerplateFilter Marks all blocks that contain a given label as "boilerplate".LabelToContentFilter Marks all blocks that contain a given label as "content".LargestContentExtractor A full-text extractor which extracts the largest text component of a page.MarkEverythingContentFilter Marks all blocks as content.MarkupTagAction Assigns labels for element CSS classes and ids to the correspondingTextBlock.MinClauseWordsFilter Keeps only blocks that have at least one segment fragment ("clause") with at least k words (default: 5).MinFulltextWordsFilter Keeps only those content blocks which contain at least k full-text words (measured byHeuristicFilterBase.getNumFullTextWords(TextBlock)).MinWordsFilter Keeps only those content blocks which contain at least k words.NumWordsRulesClassifier ClassifiesTextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.NumWordsRulesExtractor A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).SimpleBlockFusionProcessor Merges two subsequent blocks if their text densities are equal.SimpleEstimator Estimates the "goodness" of aBoilerpipeExtractoron a given document.SplitParagraphBlocksFilter Splits TextBlocks at paragraph boundaries.SurroundingToContentFilter TagAction Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.TagActionMap Base class for definition a set ofTagActions that are to be used for the HTML parsing process.TerminatingBlocksFinder Finds blocks which are potentially indicating the end of an article text and marks them withDefaultLabels.INDICATES_END_OF_TEXT.TextBlock Describes a block of text.TextBlockCondition Evaluates whether a givenTextBlockmeets a certain condition.TextDocument A text document, consisting of one or moreTextBlocks.TextDocumentStatistics Provides shallow statistics on a given TextDocumentUnicodeTokenizer Tokenizes text according to Unicode word boundaries and strips off non-word characters.