Package de.l3s.boilerpipe.sax
Class BoilerpipeHTMLParser
- java.lang.Object
-
- org.apache.xerces.parsers.XMLParser
-
- org.apache.xerces.parsers.AbstractXMLDocumentParser
-
- org.apache.xerces.parsers.AbstractSAXParser
-
- de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
-
- All Implemented Interfaces:
BoilerpipeDocumentSource,org.apache.xerces.xni.XMLDocumentHandler,org.apache.xerces.xni.XMLDTDContentModelHandler,org.apache.xerces.xni.XMLDTDHandler,org.apache.xerces.xs.PSVIProvider,org.xml.sax.Parser,org.xml.sax.XMLReader
public class BoilerpipeHTMLParser extends org.apache.xerces.parsers.AbstractSAXParser implements BoilerpipeDocumentSource
A simple SAX Parser, used byBoilerpipeSAXInput. The parser uses CyberNeko to parse HTML content.
-
-
Field Summary
-
Fields inherited from class org.apache.xerces.parsers.AbstractSAXParser
ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNING
-
Fields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD
-
Fields inherited from class org.apache.xerces.parsers.XMLParser
ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration
-
-
Constructor Summary
Constructors Modifier Constructor Description BoilerpipeHTMLParser()Constructs aBoilerpipeHTMLParserusing a default HTML content handler.protectedBoilerpipeHTMLParser(boolean ignore)BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)Constructs aBoilerpipeHTMLParserusing the givenBoilerpipeHTMLContentHandler.
-
Method Summary
Modifier and Type Method Description voidsetContentHandler(BoilerpipeHTMLContentHandler contentHandler)voidsetContentHandler(org.xml.sax.ContentHandler contentHandler)TextDocumenttoTextDocument()Returns aTextDocumentcontaining the extractedTextBlocks.-
Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl
-
Methods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl
-
-
-
-
Constructor Detail
-
BoilerpipeHTMLParser
public BoilerpipeHTMLParser()
Constructs aBoilerpipeHTMLParserusing a default HTML content handler.
-
BoilerpipeHTMLParser
public BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
Constructs aBoilerpipeHTMLParserusing the givenBoilerpipeHTMLContentHandler.- Parameters:
contentHandler-
-
BoilerpipeHTMLParser
protected BoilerpipeHTMLParser(boolean ignore)
-
-
Method Detail
-
setContentHandler
public void setContentHandler(BoilerpipeHTMLContentHandler contentHandler)
-
setContentHandler
public void setContentHandler(org.xml.sax.ContentHandler contentHandler)
- Specified by:
setContentHandlerin interfaceorg.xml.sax.XMLReader- Overrides:
setContentHandlerin classorg.apache.xerces.parsers.AbstractSAXParser
-
toTextDocument
public TextDocument toTextDocument()
Returns aTextDocumentcontaining the extractedTextBlocks. NOTE: Only call this afterAbstractSAXParser.parse(org.xml.sax.InputSource).- Specified by:
toTextDocumentin interfaceBoilerpipeDocumentSource- Returns:
- The
TextDocument
-
-