A B C D E F G H I K L M N O P Q R S T U V W X
All Classes All Packages
All Classes All Packages
All Classes All Packages
A
- A - Static variable in class org.cyberneko.html.HTMLElements
- ABBR - Static variable in class org.cyberneko.html.HTMLElements
- ACRONYM - Static variable in class org.cyberneko.html.HTMLElements
- addElement(HTMLElements.Element) - Method in class org.cyberneko.html.HTMLElements.ElementList
-
Adds an element to list, resizing if necessary.
- addLabel(String) - Method in class de.l3s.boilerpipe.document.TextBlock
-
Adds an arbitrary String label to this
TextBlock. - addLabelAction(LabelAction) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- addLabels(String...) - Method in class de.l3s.boilerpipe.document.TextBlock
-
Adds a set of labels to this
TextBlock. - addLabels(Set<String>) - Method in class de.l3s.boilerpipe.document.TextBlock
-
Adds a set of labels to this
TextBlock. - addLabelsTo(TextBlock) - Method in class de.l3s.boilerpipe.labels.LabelAction
- AddPrecedingLabelsFilter - Class in de.l3s.boilerpipe.filters.heuristics
-
Adds the labels of the preceding block to the current block, optionally adding a prefix.
- AddPrecedingLabelsFilter(String) - Constructor for class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
-
Creates a new
AddPrecedingLabelsFilterinstance. - ADDRESS - Static variable in class org.cyberneko.html.HTMLElements
- addTagAction(String, TagAction) - Method in class de.l3s.boilerpipe.sax.TagActionMap
-
Adds a particular
TagActionfor a given tag. - addTextBlock(TextBlock) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- addTo(TextBlock) - Method in class de.l3s.boilerpipe.labels.ConditionalLabelAction
- addTo(TextBlock) - Method in class de.l3s.boilerpipe.labels.LabelAction
- addWhitespaceIfNecessary() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- APPLET - Static variable in class org.cyberneko.html.HTMLElements
- AREA - Static variable in class org.cyberneko.html.HTMLElements
- ARTICLE_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
-
Works very well for most types of Article-like HTML.
- ARTICLE_METADATA - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- ArticleExtractor - Class in de.l3s.boilerpipe.extractors
-
A full-text extractor which is tuned towards news articles.
- ArticleExtractor() - Constructor for class de.l3s.boilerpipe.extractors.ArticleExtractor
- ArticleMetadataFilter - Class in de.l3s.boilerpipe.filters.heuristics
- ArticleSentencesExtractor - Class in de.l3s.boilerpipe.extractors
-
A full-text extractor which is tuned towards extracting sentences from news articles.
- ArticleSentencesExtractor() - Constructor for class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
- attributes - Variable in class org.cyberneko.html.HTMLTagBalancer.Info
-
The element attributes.
- AUGMENTATIONS - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Include infoset augmentations.
- avgNumWords() - Method in class de.l3s.boilerpipe.document.TextDocumentStatistics
-
Returns the average number of words at block-level (= overall number of words divided by the number of blocks).
B
- B - Static variable in class org.cyberneko.html.HTMLElements
- BASE - Static variable in class org.cyberneko.html.HTMLElements
- BASEFONT - Static variable in class org.cyberneko.html.HTMLElements
- BDO - Static variable in class org.cyberneko.html.HTMLElements
- BGSOUND - Static variable in class org.cyberneko.html.HTMLElements
- BIG - Static variable in class org.cyberneko.html.HTMLElements
- BLINK - Static variable in class org.cyberneko.html.HTMLElements
- BLOCK - Static variable in class org.cyberneko.html.HTMLElements.Element
-
Block element.
- BlockProximityFusion - Class in de.l3s.boilerpipe.filters.heuristics
-
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
- BlockProximityFusion(int, boolean, boolean) - Constructor for class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
-
Creates a new
BlockProximityFusioninstance. - BLOCKQUOTE - Static variable in class org.cyberneko.html.HTMLElements
- BlockTagLabelAction(LabelAction) - Constructor for class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
- BODY - Static variable in class org.cyberneko.html.HTMLElements
- BoilerpipeDocumentSource - Interface in de.l3s.boilerpipe
-
Something that can be represented as a
TextDocument. - BoilerpipeExtractor - Interface in de.l3s.boilerpipe
-
Describes a complete filter pipeline.
- BoilerpipeFilter - Interface in de.l3s.boilerpipe
-
A generic
BoilerpipeFilter. - BoilerpipeHTMLContentHandler - Class in de.l3s.boilerpipe.sax
-
A simple SAX
ContentHandler, used byBoilerpipeSAXInput. - BoilerpipeHTMLContentHandler() - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
Constructs a
BoilerpipeHTMLContentHandlerusing theDefaultTagActionMap. - BoilerpipeHTMLContentHandler(TagActionMap) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
Constructs a
BoilerpipeHTMLContentHandlerusing the givenTagActionMap. - BoilerpipeHTMLParser - Class in de.l3s.boilerpipe.sax
-
A simple SAX Parser, used by
BoilerpipeSAXInput. - BoilerpipeHTMLParser() - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
-
Constructs a
BoilerpipeHTMLParserusing a default HTML content handler. - BoilerpipeHTMLParser(boolean) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
- BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
-
Constructs a
BoilerpipeHTMLParserusing the givenBoilerpipeHTMLContentHandler. - BoilerpipeInput - Interface in de.l3s.boilerpipe
-
A source that returns
TextDocuments. - BoilerpipeProcessingException - Exception in de.l3s.boilerpipe
-
Exception for signaling failure in the processing pipeline.
- BoilerpipeProcessingException() - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
- BoilerpipeProcessingException(String) - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
- BoilerpipeProcessingException(String, Throwable) - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
- BoilerpipeProcessingException(Throwable) - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
- BoilerpipeSAXInput - Class in de.l3s.boilerpipe.sax
-
Parses an
InputSourceusing SAX and returns aTextDocument. - BoilerpipeSAXInput(InputSource) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeSAXInput
-
Creates a new instance of
BoilerpipeSAXInputfor the givenInputSource. - BoilerplateBlockFilter - Class in de.l3s.boilerpipe.filters.simple
-
Removes
TextBlocks which have explicitly been marked as "not content". - BoilerplateBlockFilter() - Constructor for class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
- bounds - Variable in class org.cyberneko.html.HTMLElements.Element
-
The bounding element code.
- BR - Static variable in class org.cyberneko.html.HTMLElements
- BUTTON - Static variable in class org.cyberneko.html.HTMLElements
C
- callEndElement(QName, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Call document handler end element.
- callStartElement(QName, XMLAttributes, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Call document handler start element.
- CANOLA_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
-
Trained on krdwrd Canola (different definition of "boilerplate").
- CanolaExtractor - Class in de.l3s.boilerpipe.extractors
- CanolaExtractor() - Constructor for class de.l3s.boilerpipe.extractors.CanolaExtractor
- CAPTION - Static variable in class org.cyberneko.html.HTMLElements
- CENTER - Static variable in class org.cyberneko.html.HTMLElements
- Chained(TagAction, TagAction) - Constructor for class de.l3s.boilerpipe.sax.CommonTagActions.Chained
- changesTagLevel() - Method in class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
- changesTagLevel() - Method in class de.l3s.boilerpipe.sax.CommonTagActions.Chained
- changesTagLevel() - Method in class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
- changesTagLevel() - Method in class de.l3s.boilerpipe.sax.MarkupTagAction
- changesTagLevel() - Method in interface de.l3s.boilerpipe.sax.TagAction
- characters(char[], int, int) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- characters(XMLString, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Characters.
- CITE - Static variable in class org.cyberneko.html.HTMLElements
- CLASSIFIER - Static variable in class de.l3s.boilerpipe.extractors.CanolaExtractor
-
The actual classifier, exposed.
- classify(TextBlock, TextBlock, TextBlock) - Method in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
- classify(TextBlock, TextBlock, TextBlock) - Method in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
- clone() - Method in class de.l3s.boilerpipe.document.TextBlock
- closes - Variable in class org.cyberneko.html.HTMLElements.Element
-
List of elements this element can close.
- closes(short) - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if this element can close the specified Element.
- code - Variable in class org.cyberneko.html.HTMLElements.Element
-
The element code.
- CODE - Static variable in class org.cyberneko.html.HTMLElements
- COL - Static variable in class org.cyberneko.html.HTMLElements
- COLGROUP - Static variable in class org.cyberneko.html.HTMLElements
- com.cloudburo.grab.webcontent - package com.cloudburo.grab.webcontent
- comment(XMLString, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Comment.
- COMMENT - Static variable in class org.cyberneko.html.HTMLElements
- CommonExtractors - Class in de.l3s.boilerpipe.extractors
-
Provides quick access to common
BoilerpipeExtractors. - CommonTagActions - Class in de.l3s.boilerpipe.sax
-
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
- CommonTagActions.BlockTagLabelAction - Class in de.l3s.boilerpipe.sax
-
CommonTagActionsfor block-level elements, which triggers someLabelActionon the generatedTextBlock. - CommonTagActions.Chained - Class in de.l3s.boilerpipe.sax
- CommonTagActions.InlineTagLabelAction - Class in de.l3s.boilerpipe.sax
- ConditionalLabelAction - Class in de.l3s.boilerpipe.labels
-
Adds labels to a
TextBlockif the given criteria are met. - ConditionalLabelAction(TextBlockCondition, String...) - Constructor for class de.l3s.boilerpipe.labels.ConditionalLabelAction
- CONTAINER - Static variable in class org.cyberneko.html.HTMLElements.Element
-
Container element.
- content - Variable in class com.cloudburo.grab.webcontent.GrabberRecord
- ContentFusion - Class in de.l3s.boilerpipe.filters.heuristics
- ContentFusion() - Constructor for class de.l3s.boilerpipe.filters.heuristics.ContentFusion
-
Creates a new
ContentFusioninstance.
D
- data - Variable in class org.cyberneko.html.HTMLElements.ElementList
-
The data in the list.
- data - Variable in class org.cyberneko.html.HTMLTagBalancer.InfoStack
-
The stack data.
- DD - Static variable in class org.cyberneko.html.HTMLElements
- de.l3s.boilerpipe - package de.l3s.boilerpipe
- de.l3s.boilerpipe.conditions - package de.l3s.boilerpipe.conditions
- de.l3s.boilerpipe.document - package de.l3s.boilerpipe.document
- de.l3s.boilerpipe.estimators - package de.l3s.boilerpipe.estimators
- de.l3s.boilerpipe.extractors - package de.l3s.boilerpipe.extractors
- de.l3s.boilerpipe.filters.english - package de.l3s.boilerpipe.filters.english
- de.l3s.boilerpipe.filters.heuristics - package de.l3s.boilerpipe.filters.heuristics
- de.l3s.boilerpipe.filters.simple - package de.l3s.boilerpipe.filters.simple
- de.l3s.boilerpipe.labels - package de.l3s.boilerpipe.labels
- de.l3s.boilerpipe.sax - package de.l3s.boilerpipe.sax
- de.l3s.boilerpipe.util - package de.l3s.boilerpipe.util
- debugString() - Method in class de.l3s.boilerpipe.document.TextDocument
-
Returns detailed debugging information about the contained
TextBlocks. - DEFAULT_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
-
Usually worse than
ArticleExtractor, but simpler/no heuristics. - DEFAULT_INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
- DEFAULT_INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
- DefaultExtractor - Class in de.l3s.boilerpipe.extractors
-
A quite generic full-text extractor.
- DefaultExtractor() - Constructor for class de.l3s.boilerpipe.extractors.DefaultExtractor
- DefaultLabels - Class in de.l3s.boilerpipe.labels
-
Some pre-defined labels which can be used in conjunction with
TextBlock.addLabel(String)andTextBlock.hasLabel(String). - DefaultLabels() - Constructor for class de.l3s.boilerpipe.labels.DefaultLabels
- DefaultTagActionMap - Class in de.l3s.boilerpipe.sax
-
Default
TagActions. - DefaultTagActionMap() - Constructor for class de.l3s.boilerpipe.sax.DefaultTagActionMap
- DEL - Static variable in class org.cyberneko.html.HTMLElements
- DensityRulesClassifier - Class in de.l3s.boilerpipe.filters.english
-
Classifies
TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities. - DensityRulesClassifier() - Constructor for class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
- DFN - Static variable in class org.cyberneko.html.HTMLElements
- DIR - Static variable in class org.cyberneko.html.HTMLElements
- DIV - Static variable in class org.cyberneko.html.HTMLElements
- DL - Static variable in class org.cyberneko.html.HTMLElements
- doctypeDecl(String, String, String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Doctype declaration.
- DOCUMENT_FRAGMENT - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Document fragment balancing only.
- DOCUMENT_FRAGMENT_DEPRECATED - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Document fragment balancing only (deprecated).
- DocumentTitleMatchClassifier - Class in de.l3s.boilerpipe.filters.heuristics
-
Marks
TextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain. - DocumentTitleMatchClassifier(String) - Constructor for class de.l3s.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
- DT - Static variable in class org.cyberneko.html.HTMLElements
E
- element - Variable in class org.cyberneko.html.HTMLTagBalancer.Info
-
The element.
- Element(short, String, int, short[], short[]) - Constructor for class org.cyberneko.html.HTMLElements.Element
-
Constructs an element object.
- Element(short, String, int, short[], short, short[]) - Constructor for class org.cyberneko.html.HTMLElements.Element
-
Constructs an element object.
- Element(short, String, int, short, short[]) - Constructor for class org.cyberneko.html.HTMLElements.Element
-
Constructs an element object.
- Element(short, String, int, short, short, short[]) - Constructor for class org.cyberneko.html.HTMLElements.Element
-
Constructs an element object.
- ElementList() - Constructor for class org.cyberneko.html.HTMLElements.ElementList
- ELEMENTS - Static variable in class org.cyberneko.html.HTMLElements
-
Element information as a contiguous list.
- ELEMENTS_ARRAY - Static variable in class org.cyberneko.html.HTMLElements
-
Element information organized by first letter.
- EM - Static variable in class org.cyberneko.html.HTMLElements
- EMBED - Static variable in class org.cyberneko.html.HTMLElements
- EMPTY - Static variable in class org.cyberneko.html.HTMLElements.Element
-
Empty element.
- EMPTY_END - Static variable in class de.l3s.boilerpipe.document.TextBlock
- EMPTY_START - Static variable in class de.l3s.boilerpipe.document.TextBlock
- emptyAttributes() - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns a set of empty attributes.
- emptyElement(QName, XMLAttributes, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Empty element.
- end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
- end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.Chained
- end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
- end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.MarkupTagAction
- end(BoilerpipeHTMLContentHandler, String, String) - Method in interface de.l3s.boilerpipe.sax.TagAction
- endCDATA(Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
End CDATA section.
- endDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- endDocument(Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
End document.
- endElement(String, String, String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- endElement(QName, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
End element.
- endGeneralEntity(String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
End entity.
- endPrefixMapping(String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- endPrefixMapping(String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
End prefix mapping.
- equals(Object) - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if the objects are equal.
- ERROR_REPORTER - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Error reporter.
- ExpandTitleToContentFilter - Class in de.l3s.boilerpipe.filters.heuristics
-
Marks all
TextBlocks "content" which are between the headline and the part that has already been marked content, if they are markedDefaultLabels.MIGHT_BE_CONTENT. - ExpandTitleToContentFilter() - Constructor for class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
- extractArticle(String, boolean) - Method in class com.cloudburo.grab.webcontent.Grabber
- extractCanloa(String) - Method in class com.cloudburo.grab.webcontent.Grabber
- extractDefault(String) - Method in class com.cloudburo.grab.webcontent.Grabber
- extractLargestContent(String) - Method in class com.cloudburo.grab.webcontent.Grabber
- ExtractorBase - Class in de.l3s.boilerpipe.extractors
-
The base class of Extractors.
- ExtractorBase() - Constructor for class de.l3s.boilerpipe.extractors.ExtractorBase
F
- fAugmentations - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Include infoset augmentations.
- fDocumentFragment - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Document fragment balancing only.
- fDocumentHandler - Variable in class org.cyberneko.html.HTMLTagBalancer
-
The document handler.
- fDocumentSource - Variable in class org.cyberneko.html.HTMLTagBalancer
-
The document source.
- fElementStack - Variable in class org.cyberneko.html.HTMLTagBalancer
-
The element stack.
- fErrorReporter - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Error reporter.
- fetch(URL) - Static method in class de.l3s.boilerpipe.sax.HTMLFetcher
-
Fetches the document at the given URL, using
URLConnection. - FIELDSET - Static variable in class org.cyberneko.html.HTMLElements
- fIgnoreOutsideContent - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Ignore outside content.
- fInlineStack - Variable in class org.cyberneko.html.HTMLTagBalancer
-
The inline stack.
- flags - Variable in class org.cyberneko.html.HTMLElements.Element
-
Informational flags.
- flushBlock() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- fNamesAttrs - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Modify HTML attribute names.
- fNamesElems - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Modify HTML element names.
- fNamespaces - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Namespaces.
- FONT - Static variable in class org.cyberneko.html.HTMLElements
- fOpenedForm - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if a form is in the stack (allow to discard opening of nested forms)
- FORM - Static variable in class org.cyberneko.html.HTMLElements
- FRAGMENT_CONTEXT_STACK - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
EXPERIMENTAL: may change in next release
Name of the property holding the stack of elements in which context a document fragment should be parsed. - FRAME - Static variable in class org.cyberneko.html.HTMLElements
- FRAMESET - Static variable in class org.cyberneko.html.HTMLElements
- fReportErrors - Variable in class org.cyberneko.html.HTMLTagBalancer
-
Report errors.
- fSeenAnything - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if seen anything.
- fSeenBodyElement - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if seen <body< element.
- fSeenDoctype - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if root element has been seen.
- fSeenHeadElement - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if seen <head< element.
- fSeenRootElement - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if root element has been seen.
- fSeenRootElementEnd - Variable in class org.cyberneko.html.HTMLTagBalancer
-
True if seen the end of the document element.
G
- getCharset() - Method in class de.l3s.boilerpipe.sax.HTMLDocument
- getContainedTextElements() - Method in class de.l3s.boilerpipe.document.TextBlock
-
Returns the containedTextElements BitSet, or
null. - getContent() - Method in class de.l3s.boilerpipe.document.TextDocument
-
Returns the
TextDocument's content. - getData() - Method in class de.l3s.boilerpipe.sax.HTMLDocument
- getDefaultInstance() - Static method in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
-
Returns the singleton instance for DeleteBlocksAfterContentFilter.
- getDefaultInstance() - Static method in class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
- getDocumentHandler() - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns the document handler.
- getDocumentSource() - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns the document source.
- getElement(short) - Static method in class org.cyberneko.html.HTMLElements
-
Returns the element information for the specified element code.
- getElement(String) - Static method in class org.cyberneko.html.HTMLElements
-
Returns the element information for the specified element name.
- getElement(String, HTMLElements.Element) - Static method in class org.cyberneko.html.HTMLElements
-
Returns the element information for the specified element name.
- getElement(QName) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns an HTML element.
- getElementDepth(HTMLElements.Element) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns the depth of the open tag associated with the specified element name or -1 if no matching element is found.
- getExtraStyleSheet() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Returns the extra stylesheet definition that will be inserted in the HEAD element.
- getFeatureDefault(String) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns the default state for a feature.
- getInstance() - Static method in class de.l3s.boilerpipe.extractors.ArticleExtractor
-
Returns the singleton instance for
ArticleExtractor. - getInstance() - Static method in class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
-
Returns the singleton instance for
ArticleSentencesExtractor. - getInstance() - Static method in class de.l3s.boilerpipe.extractors.CanolaExtractor
-
Returns the singleton instance for
CanolaExtractor. - getInstance() - Static method in class de.l3s.boilerpipe.extractors.DefaultExtractor
-
Returns the singleton instance for
DefaultExtractor. - getInstance() - Static method in class de.l3s.boilerpipe.extractors.LargestContentExtractor
-
Returns the singleton instance for
LargestContentExtractor. - getInstance() - Static method in class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
-
Returns the singleton instance for
NumWordsRulesExtractor. - getInstance() - Static method in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
-
Returns the singleton instance for RulebasedBoilerpipeClassifier.
- getInstance() - Static method in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
-
Returns the singleton instance for RulebasedBoilerpipeClassifier.
- getInstance() - Static method in class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
-
Returns the singleton instance for TerminatingBlocksFinder.
- getInstance() - Static method in class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
-
Returns the singleton instance for ExpandTitleToContentFilter.
- getInstance() - Static method in class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
-
Returns the singleton instance for BlockFusionProcessor.
- getInstance() - Static method in class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
-
Returns the singleton instance for BoilerplateBlockFilter.
- getInstance() - Static method in class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
-
Returns the singleton instance for TerminatingBlocksFinder.
- getLabels() - Method in class de.l3s.boilerpipe.document.TextBlock
-
Returns the labels associated to this TextBlock, or
nullif no such labels exist. - getLinkDensity() - Method in class de.l3s.boilerpipe.document.TextBlock
- getNamesValue(String) - Static method in class org.cyberneko.html.HTMLTagBalancer
-
Converts HTML names string value to constant value.
- getNumWords() - Method in class de.l3s.boilerpipe.document.TextBlock
- getNumWords() - Method in class de.l3s.boilerpipe.document.TextDocumentStatistics
-
Returns the overall number of words in all blocks.
- getNumWordsInAnchorText() - Method in class de.l3s.boilerpipe.document.TextBlock
- getOffsetBlocksEnd() - Method in class de.l3s.boilerpipe.document.TextBlock
- getOffsetBlocksStart() - Method in class de.l3s.boilerpipe.document.TextBlock
- getParentDepth(HTMLElements.Element[], short) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns the depth of the open tag associated with the specified element parent names or -1 if no matching element is found.
- getPostHighlight() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Returns the string that will be inserted after any highlighted HTML block.
- getPotentialTitles() - Method in class de.l3s.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
- getPreHighlight() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Returns the string that will be inserted before any highlighted HTML block.
- getPropertyDefault(String) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns the default state for a property.
- getRecognizedFeatures() - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns recognized features.
- getRecognizedProperties() - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns recognized properties.
- getTagLevel() - Method in class de.l3s.boilerpipe.document.TextBlock
- getText() - Method in class de.l3s.boilerpipe.document.TextBlock
- getText(boolean, boolean) - Method in class de.l3s.boilerpipe.document.TextDocument
-
Returns the
TextDocument's content, non-content or both - getText(TextDocument) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
-
Extracts text from the given
TextDocumentobject. - getText(TextDocument) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
-
Extracts text from the given
TextDocumentobject. - getText(Reader) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
-
Extracts text from the HTML code available from the given
Reader. - getText(Reader) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
-
Extracts text from the HTML code available from the given
Reader. - getText(String) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
-
Extracts text from the HTML code given as a String.
- getText(String) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
-
Extracts text from the HTML code given as a String.
- getText(URL) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
-
Extracts text from the HTML code available from the given
URL. - getText(InputSource) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
-
Extracts text from the HTML code available from the given
InputSource. - getText(InputSource) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
-
Extracts text from the HTML code available from the given
InputSource. - getTextBlocks() - Method in class de.l3s.boilerpipe.document.TextDocument
-
Returns the
TextBlocks of this document. - getTextDensity() - Method in class de.l3s.boilerpipe.document.TextBlock
- getTextDocument() - Method in interface de.l3s.boilerpipe.BoilerpipeInput
-
Returns (somehow) a
TextDocument. - getTextDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeSAXInput
-
Retrieves the
TextDocumentusing a default HTML parser. - getTextDocument(BoilerpipeHTMLParser) - Method in class de.l3s.boilerpipe.sax.BoilerpipeSAXInput
-
Retrieves the
TextDocumentusing the given HTML parser. - getTitle() - Method in class de.l3s.boilerpipe.document.TextDocument
-
Returns the "main" title for this document, or
nullif no such title has ben set. - getTitle() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Grabber - Class in com.cloudburo.grab.webcontent
- Grabber() - Constructor for class com.cloudburo.grab.webcontent.Grabber
- GrabberRecord - Class in com.cloudburo.grab.webcontent
- GrabberRecord() - Constructor for class com.cloudburo.grab.webcontent.GrabberRecord
H
- H1 - Static variable in class org.cyberneko.html.HTMLElements
- H2 - Static variable in class org.cyberneko.html.HTMLElements
- H3 - Static variable in class org.cyberneko.html.HTMLElements
- H4 - Static variable in class org.cyberneko.html.HTMLElements
- H5 - Static variable in class org.cyberneko.html.HTMLElements
- H6 - Static variable in class org.cyberneko.html.HTMLElements
- hashCode() - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns a hash code for this object.
- hasLabel(String) - Method in class de.l3s.boilerpipe.document.TextBlock
-
Checks whether this TextBlock has the given label.
- HEAD - Static variable in class org.cyberneko.html.HTMLElements
- HR - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- HR - Static variable in class org.cyberneko.html.HTMLElements
- HTML - Static variable in class org.cyberneko.html.HTMLElements
- HTMLDocument - Class in de.l3s.boilerpipe.sax
-
An
InputSourceableforHTMLFetcher. - HTMLDocument(byte[], Charset) - Constructor for class de.l3s.boilerpipe.sax.HTMLDocument
- HTMLDocument(String) - Constructor for class de.l3s.boilerpipe.sax.HTMLDocument
- HTMLElements - Class in org.cyberneko.html
-
Collection of HTML element information.
- HTMLElements() - Constructor for class org.cyberneko.html.HTMLElements
- HTMLElements.Element - Class in org.cyberneko.html
-
Element information.
- HTMLElements.ElementList - Class in org.cyberneko.html
-
Unsynchronized list of elements.
- HTMLFetcher - Class in de.l3s.boilerpipe.sax
-
A very simple HTTP/HTML fetcher, really just for demo purposes.
- HTMLHighlighter - Class in de.l3s.boilerpipe.sax
-
Highlights text blocks in an HTML document that have been marked as "content" in the corresponding
TextDocument. - HTMLTagBalancer - Class in org.cyberneko.html
-
Balances tags in an HTML document.
- HTMLTagBalancer() - Constructor for class org.cyberneko.html.HTMLTagBalancer
- HTMLTagBalancer.Info - Class in org.cyberneko.html
-
Element info for each start element.
- HTMLTagBalancer.InfoStack - Class in org.cyberneko.html
-
Unsynchronized stack of element information.
I
- I - Static variable in class org.cyberneko.html.HTMLElements
- IFRAME - Static variable in class org.cyberneko.html.HTMLElements
- ignorableWhitespace(char[], int, int) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- ignorableWhitespace(XMLString, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Ignorable whitespace.
- IGNORE_OUTSIDE_CONTENT - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Ignore outside content.
- IgnoreBlocksAfterContentFilter - Class in de.l3s.boilerpipe.filters.english
-
Marks all blocks as "non-content" that occur after blocks that have been marked
DefaultLabels.INDICATES_END_OF_TEXT. - IgnoreBlocksAfterContentFilter(int) - Constructor for class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
- IgnoreBlocksAfterContentFromEndFilter - Class in de.l3s.boilerpipe.filters.english
-
Marks all blocks as "non-content" that occur after blocks that have been marked
DefaultLabels.INDICATES_END_OF_TEXT, and after any content block. - ILAYER - Static variable in class org.cyberneko.html.HTMLElements
- IMG - Static variable in class org.cyberneko.html.HTMLElements
- INDICATES_END_OF_TEXT - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- Info(HTMLElements.Element, QName) - Constructor for class org.cyberneko.html.HTMLTagBalancer.Info
-
Creates an element information object.
- Info(HTMLElements.Element, QName, XMLAttributes) - Constructor for class org.cyberneko.html.HTMLTagBalancer.Info
-
Creates an element information object.
- InfoStack() - Constructor for class org.cyberneko.html.HTMLTagBalancer.InfoStack
- INLINE - Static variable in class org.cyberneko.html.HTMLElements.Element
-
Inline element.
- InlineTagLabelAction(LabelAction) - Constructor for class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
- INPUT - Static variable in class org.cyberneko.html.HTMLElements
- InputSourceable - Interface in de.l3s.boilerpipe.sax
-
An InputSourceable can return an arbitrary number of new
InputSources for a given document. - INS - Static variable in class org.cyberneko.html.HTMLElements
- INSTANCE - Static variable in class de.l3s.boilerpipe.estimators.SimpleEstimator
-
Returns the singleton instance of
SimpleEstimator - INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.ArticleExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.CanolaExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.DefaultExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.KeepEverythingExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.LargestContentExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFromEndFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.ArticleMetadataFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.ContentFusion
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.LabelFusion
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.InvertedFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
- INSTANCE - Static variable in class de.l3s.boilerpipe.sax.DefaultTagActionMap
- INSTANCE_200 - Static variable in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
- INSTANCE_EXPAND_TO_SAME_TAGLEVEL - Static variable in class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
- INSTANCE_PRE - Static variable in class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
- INSTANCE_STRICTLY_NOT_CONTENT - Static variable in class de.l3s.boilerpipe.filters.simple.LabelToBoilerplateFilter
- INSTANCE_TEXT - Static variable in class de.l3s.boilerpipe.filters.simple.SurroundingToContentFilter
- InvertedFilter - Class in de.l3s.boilerpipe.filters.simple
-
Reverts the "isContent" flag for all
TextBlocks - isBlock() - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if this element is a block element.
- isContainer() - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if this element is a container element.
- isContent() - Method in class de.l3s.boilerpipe.document.TextBlock
- isEmpty() - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if this element is an empty element.
- ISINDEX - Static variable in class org.cyberneko.html.HTMLElements
- isInline() - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if this element is an inline element.
- isLowQuality(TextDocumentStatistics, TextDocumentStatistics) - Method in class de.l3s.boilerpipe.estimators.SimpleEstimator
-
Given the statistics of the document before and after applying the
BoilerpipeExtractor, can we regard the extraction quality (too) low? Works well withDefaultExtractor,ArticleExtractorand others. - isOutputHighlightOnly() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
If true, only HTML enclosed within highlighted content will be returned
- isParent(HTMLElements.Element) - Method in class org.cyberneko.html.HTMLElements.Element
-
Indicates if the provided element is an accepted parent of current element
- isSpecial() - Method in class org.cyberneko.html.HTMLElements.Element
-
Returns true if this element is special -- if its content should be parsed ignoring markup.
K
- KBD - Static variable in class org.cyberneko.html.HTMLElements
- KEEP_EVERYTHING_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
-
Dummy Extractor; should return the input text.
- KeepEverythingExtractor - Class in de.l3s.boilerpipe.extractors
-
Marks everything as content.
- KeepEverythingWithMinKWordsExtractor - Class in de.l3s.boilerpipe.extractors
-
A full-text extractor which extracts the largest text component of a page.
- KeepEverythingWithMinKWordsExtractor(int) - Constructor for class de.l3s.boilerpipe.extractors.KeepEverythingWithMinKWordsExtractor
- KeepLargestBlockFilter - Class in de.l3s.boilerpipe.filters.heuristics
-
Keeps the largest
TextBlockonly (by the number of words). - KeepLargestBlockFilter(boolean) - Constructor for class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
- KeepLargestFulltextBlockFilter - Class in de.l3s.boilerpipe.filters.english
-
Keeps the largest
TextBlockonly (by the number of words). - KeepLargestFulltextBlockFilter() - Constructor for class de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
- KEYGEN - Static variable in class org.cyberneko.html.HTMLElements
L
- LABEL - Static variable in class org.cyberneko.html.HTMLElements
- LabelAction - Class in de.l3s.boilerpipe.labels
-
Helps adding labels to
TextBlocks. - LabelAction(String...) - Constructor for class de.l3s.boilerpipe.labels.LabelAction
- LabelFusion - Class in de.l3s.boilerpipe.filters.heuristics
-
Fuses adjacent blocks if their labels are equal.
- LabelFusion(String) - Constructor for class de.l3s.boilerpipe.filters.heuristics.LabelFusion
-
Creates a new
LabelFusioninstance. - labels - Variable in class de.l3s.boilerpipe.labels.LabelAction
- LabelToBoilerplateFilter - Class in de.l3s.boilerpipe.filters.simple
-
Marks all blocks that contain a given label as "boilerplate".
- LabelToBoilerplateFilter(String...) - Constructor for class de.l3s.boilerpipe.filters.simple.LabelToBoilerplateFilter
- LabelToContentFilter - Class in de.l3s.boilerpipe.filters.simple
-
Marks all blocks that contain a given label as "content".
- LabelToContentFilter(String...) - Constructor for class de.l3s.boilerpipe.filters.simple.LabelToContentFilter
- LARGEST_CONTENT_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
-
Like
DefaultExtractor, but keeps the largest text block only. - LargestContentExtractor - Class in de.l3s.boilerpipe.extractors
-
A full-text extractor which extracts the largest text component of a page.
- LAYER - Static variable in class org.cyberneko.html.HTMLElements
- LEGEND - Static variable in class org.cyberneko.html.HTMLElements
- LI - Static variable in class org.cyberneko.html.HTMLElements
- LINK - Static variable in class org.cyberneko.html.HTMLElements
- LISTING - Static variable in class org.cyberneko.html.HTMLElements
M
- MAP - Static variable in class org.cyberneko.html.HTMLElements
- MarkEverythingContentFilter - Class in de.l3s.boilerpipe.filters.simple
-
Marks all blocks as content.
- MARKUP_PREFIX - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- MarkupTagAction - Class in de.l3s.boilerpipe.sax
-
Assigns labels for element CSS classes and ids to the corresponding
TextBlock. - MarkupTagAction(boolean) - Constructor for class de.l3s.boilerpipe.sax.MarkupTagAction
- MARQUEE - Static variable in class org.cyberneko.html.HTMLElements
- MAX_DISTANCE_1 - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
- MAX_DISTANCE_1_CONTENT_ONLY - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
- MAX_DISTANCE_1_CONTENT_ONLY_SAME_TAGLEVEL - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
- MAX_DISTANCE_1_SAME_TAGLEVEL - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
- meetsCondition(TextBlock) - Method in interface de.l3s.boilerpipe.conditions.TextBlockCondition
-
Returns
trueiff the givenTextBlocktb meets the defined condition. - MENU - Static variable in class org.cyberneko.html.HTMLElements
- mergeNext(TextBlock) - Method in class de.l3s.boilerpipe.document.TextBlock
- META - Static variable in class org.cyberneko.html.HTMLElements
- MIGHT_BE_CONTENT - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- MinClauseWordsFilter - Class in de.l3s.boilerpipe.filters.simple
-
Keeps only blocks that have at least one segment fragment ("clause") with at least k words (default: 5).
- MinClauseWordsFilter(int) - Constructor for class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
- MinClauseWordsFilter(int, boolean) - Constructor for class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
- MinFulltextWordsFilter - Class in de.l3s.boilerpipe.filters.english
-
Keeps only those content blocks which contain at least k full-text words (measured by
HeuristicFilterBase.getNumFullTextWords(TextBlock)). - MinFulltextWordsFilter(int) - Constructor for class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
- MinWordsFilter - Class in de.l3s.boilerpipe.filters.simple
-
Keeps only those content blocks which contain at least k words.
- MinWordsFilter(int) - Constructor for class de.l3s.boilerpipe.filters.simple.MinWordsFilter
- modifyName(String, short) - Static method in class org.cyberneko.html.HTMLTagBalancer
-
Modifies the given name based on the specified mode.
- MULTICOL - Static variable in class org.cyberneko.html.HTMLElements
N
- name - Variable in class org.cyberneko.html.HTMLElements.Element
-
The element name.
- NAMES_ATTRS - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Modify HTML attribute names: { "upper", "lower", "default" }.
- NAMES_ELEMS - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Modify HTML element names: { "upper", "lower", "default" }.
- NAMES_LOWERCASE - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Lowercase HTML names.
- NAMES_MATCH - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Match HTML element names.
- NAMES_NO_CHANGE - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Don't modify HTML names.
- NAMES_UPPERCASE - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Uppercase HTML names.
- NAMESPACES - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Namespaces.
- newExtractingInstance() - Static method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Creates a new
HTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup. - newHighlightingInstance() - Static method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Creates a new
HTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted. - NEXTID - Static variable in class org.cyberneko.html.HTMLElements
- NO_SUCH_ELEMENT - Static variable in class org.cyberneko.html.HTMLElements
-
No such element.
- NOBR - Static variable in class org.cyberneko.html.HTMLElements
- NOEMBED - Static variable in class org.cyberneko.html.HTMLElements
- NOFRAMES - Static variable in class org.cyberneko.html.HTMLElements
- NOLAYER - Static variable in class org.cyberneko.html.HTMLElements
- NOSCRIPT - Static variable in class org.cyberneko.html.HTMLElements
- NumWordsRulesClassifier - Class in de.l3s.boilerpipe.filters.english
-
Classifies
TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block. - NumWordsRulesClassifier() - Constructor for class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
- NumWordsRulesExtractor - Class in de.l3s.boilerpipe.extractors
-
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
- NumWordsRulesExtractor() - Constructor for class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
O
- OBJECT - Static variable in class org.cyberneko.html.HTMLElements
- OL - Static variable in class org.cyberneko.html.HTMLElements
- OPTGROUP - Static variable in class org.cyberneko.html.HTMLElements
- OPTION - Static variable in class org.cyberneko.html.HTMLElements
- org.cyberneko.html - package org.cyberneko.html
P
- P - Static variable in class org.cyberneko.html.HTMLElements
- PARAM - Static variable in class org.cyberneko.html.HTMLElements
- parent - Variable in class org.cyberneko.html.HTMLElements.Element
-
Parent elements.
- parentCodes - Variable in class org.cyberneko.html.HTMLElements.Element
-
Parent elements.
- peek() - Method in class org.cyberneko.html.HTMLTagBalancer.InfoStack
-
Peeks at the top of the stack.
- PLAINTEXT - Static variable in class org.cyberneko.html.HTMLElements
- pop() - Method in class org.cyberneko.html.HTMLTagBalancer.InfoStack
-
Pops the top item off of the stack.
- PRE - Static variable in class org.cyberneko.html.HTMLElements
- process(TextDocument) - Method in interface de.l3s.boilerpipe.BoilerpipeFilter
-
Processes the given document
doc. - process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.ArticleExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.CanolaExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.DefaultExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.KeepEverythingExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.KeepEverythingWithMinKWordsExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.LargestContentExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFromEndFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.ArticleMetadataFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.ContentFusion
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.LabelFusion
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.InvertedFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.LabelToBoilerplateFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.LabelToContentFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.MinWordsFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
- process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.SurroundingToContentFilter
- process(TextDocument, String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Processes the given
TextDocumentand the original HTML text (as a String). - process(TextDocument, InputSource) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Processes the given
TextDocumentand the original HTML text (as anInputSource). - process(URL, BoilerpipeExtractor) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
- processingInstruction(String, String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- processingInstruction(String, XMLString, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Processing instruction.
- push(HTMLTagBalancer.Info) - Method in class org.cyberneko.html.HTMLTagBalancer.InfoStack
-
Pushes element information onto the stack.
Q
- Q - Static variable in class org.cyberneko.html.HTMLElements
- qname - Variable in class org.cyberneko.html.HTMLTagBalancer.Info
-
The element qualified name.
R
- RB - Static variable in class org.cyberneko.html.HTMLElements
- RBC - Static variable in class org.cyberneko.html.HTMLElements
- recycle() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
Recycles this instance.
- removeLabel(String) - Method in class de.l3s.boilerpipe.document.TextBlock
- REPORT_ERRORS - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Report errors.
- reset(XMLComponentManager) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Resets the component.
- RP - Static variable in class org.cyberneko.html.HTMLElements
- RT - Static variable in class org.cyberneko.html.HTMLElements
- RTC - Static variable in class org.cyberneko.html.HTMLElements
- RUBY - Static variable in class org.cyberneko.html.HTMLElements
S
- S - Static variable in class org.cyberneko.html.HTMLElements
- SAMP - Static variable in class org.cyberneko.html.HTMLElements
- SCRIPT - Static variable in class org.cyberneko.html.HTMLElements
- SELECT - Static variable in class org.cyberneko.html.HTMLElements
- setContentHandler(BoilerpipeHTMLContentHandler) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
- setContentHandler(ContentHandler) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
- setDocumentHandler(XMLDocumentHandler) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Sets the document handler.
- setDocumentLocator(Locator) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- setDocumentSource(XMLDocumentSource) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Sets the document source.
- setExtraStyleSheet(String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Sets the extra stylesheet definition that will be inserted in the HEAD element.
- setFeature(String, boolean) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Sets a feature.
- setIsContent(boolean) - Method in class de.l3s.boilerpipe.document.TextBlock
- setOutputHighlightOnly(boolean) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.
- setPostHighlight(String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Sets the string that will be inserted after any highlighted HTML block.
- setPreHighlight(String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
-
Sets the string that will be inserted prior to any highlighted HTML block.
- setProperty(String, Object) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Sets a property.
- setTagAction(String, TagAction) - Method in class de.l3s.boilerpipe.sax.TagActionMap
-
Sets a particular
TagActionfor a given tag. - setTagLevel(int) - Method in class de.l3s.boilerpipe.document.TextBlock
- setTitle(String) - Method in class de.l3s.boilerpipe.document.TextDocument
-
Updates the "main" title for this document.
- setTitle(String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- SimpleBlockFusionProcessor - Class in de.l3s.boilerpipe.filters.heuristics
-
Merges two subsequent blocks if their text densities are equal.
- SimpleBlockFusionProcessor() - Constructor for class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
- SimpleEstimator - Class in de.l3s.boilerpipe.estimators
-
Estimates the "goodness" of a
BoilerpipeExtractoron a given document. - size - Variable in class org.cyberneko.html.HTMLElements.ElementList
-
The size of the list.
- skippedEntity(String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- SMALL - Static variable in class org.cyberneko.html.HTMLElements
- SOUND - Static variable in class org.cyberneko.html.HTMLElements
- SPACER - Static variable in class org.cyberneko.html.HTMLElements
- SPAN - Static variable in class org.cyberneko.html.HTMLElements
- SPECIAL - Static variable in class org.cyberneko.html.HTMLElements.Element
-
Special element.
- SplitParagraphBlocksFilter - Class in de.l3s.boilerpipe.filters.simple
-
Splits TextBlocks at paragraph boundaries.
- SplitParagraphBlocksFilter() - Constructor for class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
- start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
- start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.Chained
- start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
- start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.MarkupTagAction
- start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in interface de.l3s.boilerpipe.sax.TagAction
- startCDATA(Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Start CDATA section.
- startDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- startDocument(XMLLocator, String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Start document.
- startDocument(XMLLocator, String, NamespaceContext, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Start document.
- startElement(String, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- startElement(QName, XMLAttributes, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Start element.
- startGeneralEntity(String, XMLResourceIdentifier, String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Start entity.
- startPrefixMapping(String, String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- startPrefixMapping(String, String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Start prefix mapping.
- STRICTLY_NOT_CONTENT - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- STRIKE - Static variable in class org.cyberneko.html.HTMLElements
- STRONG - Static variable in class org.cyberneko.html.HTMLElements
- STYLE - Static variable in class org.cyberneko.html.HTMLElements
- SUB - Static variable in class org.cyberneko.html.HTMLElements
- SUP - Static variable in class org.cyberneko.html.HTMLElements
- SurroundingToContentFilter - Class in de.l3s.boilerpipe.filters.simple
- SurroundingToContentFilter(TextBlockCondition) - Constructor for class de.l3s.boilerpipe.filters.simple.SurroundingToContentFilter
- SYNTHESIZED_ITEM - Static variable in class org.cyberneko.html.HTMLTagBalancer
-
Synthesized event info item.
- synthesizedAugs() - Method in class org.cyberneko.html.HTMLTagBalancer
-
Returns an augmentations object with a synthesized item added.
T
- TA_ANCHOR_TEXT - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Marks this tag as "anchor" (this should usually only be set for the
<A>tag). - TA_BLOCK_LEVEL - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Explicitly marks this tag a simple "block-level" element, which always generates whitespace
- TA_BODY - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Marks this tag the body element (this should usually only be set for the
<BODY>tag). - TA_FONT - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Special TagAction for the
<FONT>tag, which keeps track of the absolute and relative font size. - TA_IGNORABLE_ELEMENT - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Marks this tag as "ignorable", i.e.
- TA_INLINE - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Deprecated.Use
CommonTagActions.TA_INLINE_WHITESPACEinstead - TA_INLINE_NO_WHITESPACE - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Marks this tag a simple "inline" element, which neither generates whitespace, nor a new block.
- TA_INLINE_WHITESPACE - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
-
Marks this tag a simple "inline" element, which generates whitespace, but no new block.
- TABLE - Static variable in class org.cyberneko.html.HTMLElements
- TagAction - Interface in de.l3s.boilerpipe.sax
-
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
- TagActionMap - Class in de.l3s.boilerpipe.sax
-
Base class for definition a set of
TagActions that are to be used for the HTML parsing process. - TagActionMap() - Constructor for class de.l3s.boilerpipe.sax.TagActionMap
- tagBalancingListener - Variable in class org.cyberneko.html.HTMLTagBalancer
- TBODY - Static variable in class org.cyberneko.html.HTMLElements
- TD - Static variable in class org.cyberneko.html.HTMLElements
- TerminatingBlocksFinder - Class in de.l3s.boilerpipe.filters.english
-
Finds blocks which are potentially indicating the end of an article text and marks them with
DefaultLabels.INDICATES_END_OF_TEXT. - TerminatingBlocksFinder() - Constructor for class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
- TEXTAREA - Static variable in class org.cyberneko.html.HTMLElements
- TextBlock - Class in de.l3s.boilerpipe.document
-
Describes a block of text.
- TextBlock(String) - Constructor for class de.l3s.boilerpipe.document.TextBlock
- TextBlock(String, BitSet, int, int, int, int, int) - Constructor for class de.l3s.boilerpipe.document.TextBlock
- TextBlockCondition - Interface in de.l3s.boilerpipe.conditions
-
Evaluates whether a given
TextBlockmeets a certain condition. - textDecl(String, String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
Text declaration.
- TextDocument - Class in de.l3s.boilerpipe.document
-
A text document, consisting of one or more
TextBlocks. - TextDocument(String, List<TextBlock>) - Constructor for class de.l3s.boilerpipe.document.TextDocument
-
Creates a new
TextDocumentwith givenTextBlocks and given title. - TextDocument(List<TextBlock>) - Constructor for class de.l3s.boilerpipe.document.TextDocument
-
Creates a new
TextDocumentwith givenTextBlocks, and no title. - TextDocumentStatistics - Class in de.l3s.boilerpipe.document
-
Provides shallow statistics on a given TextDocument
- TextDocumentStatistics(TextDocument, boolean) - Constructor for class de.l3s.boilerpipe.document.TextDocumentStatistics
-
Computes statistics on a given
TextDocument. - TFOOT - Static variable in class org.cyberneko.html.HTMLElements
- TH - Static variable in class org.cyberneko.html.HTMLElements
- THEAD - Static variable in class org.cyberneko.html.HTMLElements
- TITLE - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
- TITLE - Static variable in class org.cyberneko.html.HTMLElements
- toInputSource() - Method in class de.l3s.boilerpipe.sax.HTMLDocument
- toInputSource() - Method in interface de.l3s.boilerpipe.sax.InputSourceable
- tokenize(CharSequence) - Static method in class de.l3s.boilerpipe.util.UnicodeTokenizer
-
Tokenizes the text and returns an array of tokens.
- top - Variable in class org.cyberneko.html.HTMLTagBalancer.InfoStack
-
The top of the stack.
- toString() - Method in class de.l3s.boilerpipe.document.TextBlock
- toString() - Method in class de.l3s.boilerpipe.labels.LabelAction
- toString() - Method in class org.cyberneko.html.HTMLElements.Element
-
Provides a simple representation to make debugging easier
- toString() - Method in class org.cyberneko.html.HTMLTagBalancer.Info
-
Simple representation to make debugging easier
- toString() - Method in class org.cyberneko.html.HTMLTagBalancer.InfoStack
-
Simple representation to make debugging easier
- toTextDocument() - Method in interface de.l3s.boilerpipe.BoilerpipeDocumentSource
- toTextDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
Returns a
TextDocumentcontaining the extractedTextBlocks. - toTextDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
-
Returns a
TextDocumentcontaining the extractedTextBlocks. - TR - Static variable in class org.cyberneko.html.HTMLElements
- TT - Static variable in class org.cyberneko.html.HTMLElements
U
- U - Static variable in class org.cyberneko.html.HTMLElements
- UL - Static variable in class org.cyberneko.html.HTMLElements
- UnicodeTokenizer - Class in de.l3s.boilerpipe.util
-
Tokenizes text according to Unicode word boundaries and strips off non-word characters.
- UnicodeTokenizer() - Constructor for class de.l3s.boilerpipe.util.UnicodeTokenizer
- UNKNOWN - Static variable in class org.cyberneko.html.HTMLElements
- url - Variable in class com.cloudburo.grab.webcontent.GrabberRecord
V
- VAR - Static variable in class org.cyberneko.html.HTMLElements
W
- WBR - Static variable in class org.cyberneko.html.HTMLElements
X
- XML - Static variable in class org.cyberneko.html.HTMLElements
- xmlDecl(String, String, String, Augmentations) - Method in class org.cyberneko.html.HTMLTagBalancer
-
XML declaration.
- XMP - Static variable in class org.cyberneko.html.HTMLElements
All Classes All Packages