Class BoilerpipeHTMLParser

  • All Implemented Interfaces:
    BoilerpipeDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, org.apache.xerces.xni.XMLDTDContentModelHandler, org.apache.xerces.xni.XMLDTDHandler, org.apache.xerces.xs.PSVIProvider, org.xml.sax.Parser, org.xml.sax.XMLReader

    public class BoilerpipeHTMLParser
    extends org.apache.xerces.parsers.AbstractSAXParser
    implements BoilerpipeDocumentSource
    A simple SAX Parser, used by BoilerpipeSAXInput. The parser uses CyberNeko to parse HTML content.
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.xerces.parsers.AbstractSAXParser

        org.apache.xerces.parsers.AbstractSAXParser.AttributesProxy, org.apache.xerces.parsers.AbstractSAXParser.LocatorProxy
    • Field Summary

      • Fields inherited from class org.apache.xerces.parsers.AbstractSAXParser

        ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNING
      • Fields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser

        fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD
      • Fields inherited from class org.apache.xerces.parsers.XMLParser

        ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration
      • Fields inherited from interface org.apache.xerces.xni.XMLDTDContentModelHandler

        OCCURS_ONE_OR_MORE, OCCURS_ZERO_OR_MORE, OCCURS_ZERO_OR_ONE, SEPARATOR_CHOICE, SEPARATOR_SEQUENCE
      • Fields inherited from interface org.apache.xerces.xni.XMLDTDHandler

        CONDITIONAL_IGNORE, CONDITIONAL_INCLUDE
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void setContentHandler​(BoilerpipeHTMLContentHandler contentHandler)  
      void setContentHandler​(org.xml.sax.ContentHandler contentHandler)  
      TextDocument toTextDocument()
      Returns a TextDocument containing the extracted TextBlock s.
      • Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser

        attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl
      • Methods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser

        any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl
      • Methods inherited from class org.apache.xerces.parsers.XMLParser

        parse
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • setContentHandler

        public void setContentHandler​(org.xml.sax.ContentHandler contentHandler)
        Specified by:
        setContentHandler in interface org.xml.sax.XMLReader
        Overrides:
        setContentHandler in class org.apache.xerces.parsers.AbstractSAXParser