Class TesseractOCRParser

  • All Implemented Interfaces:
    Serializable, org.apache.tika.config.Initializable, org.apache.tika.parser.Parser

    public class TesseractOCRParser
    extends org.apache.tika.parser.AbstractParser
    implements org.apache.tika.config.Initializable
    TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, create a TesseractOCRConfig object and pass it through a ParseContext. Tesseract-ocr must be installed and on system path or the path to its root folder must be provided:

    TesseractOCRConfig config = new TesseractOCRConfig();
    //Needed if tesseract is not on system path
    config.setTesseractPath(tesseractFolder);
    parseContext.set(TesseractOCRConfig.class, config);

    See Also:
    Serialized Form
    • Constructor Detail

      • TesseractOCRParser

        public TesseractOCRParser()
    • Method Detail

      • getSupportedTypes

        public Set<org.apache.tika.mime.MediaType> getSupportedTypes​(org.apache.tika.parser.ParseContext context)
        Specified by:
        getSupportedTypes in interface org.apache.tika.parser.Parser
      • parse

        public void parse​(InputStream stream,
                          ContentHandler handler,
                          org.apache.tika.metadata.Metadata metadata,
                          org.apache.tika.parser.ParseContext parseContext)
                   throws IOException,
                          SAXException,
                          org.apache.tika.exception.TikaException
        Specified by:
        parse in interface org.apache.tika.parser.Parser
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • parseInline

        public void parseInline​(InputStream stream,
                                org.apache.tika.sax.XHTMLContentHandler xhtml,
                                org.apache.tika.parser.ParseContext parseContext,
                                TesseractOCRConfig config)
                         throws IOException,
                                SAXException,
                                org.apache.tika.exception.TikaException
        Use this to parse content without starting a new document. This appends SAX events to xhtml without re-adding the metadata, body start, etc.
        Parameters:
        stream - inputstream
        xhtml - handler
        config - TesseractOCRConfig to use for this parse
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • initialize

        public void initialize​(Map<String,​org.apache.tika.config.Param> params)
                        throws org.apache.tika.exception.TikaConfigException
        no-op
        Specified by:
        initialize in interface org.apache.tika.config.Initializable
        Parameters:
        params - params to use for initialization
        Throws:
        org.apache.tika.exception.TikaConfigException
      • checkInitialization

        public void checkInitialization​(org.apache.tika.config.InitializableProblemHandler problemHandler)
                                 throws org.apache.tika.exception.TikaConfigException
        Specified by:
        checkInitialization in interface org.apache.tika.config.Initializable
        Throws:
        org.apache.tika.exception.TikaConfigException
      • hasWarned

        protected boolean hasWarned()
      • warn

        protected void warn()
      • setTesseractPath

        @Field
        public void setTesseractPath​(String tesseractPath)
      • setTessdataPath

        @Field
        public void setTessdataPath​(String tessdataPath)
      • setLanguage

        @Field
        public void setLanguage​(String language)
      • setPageSegMode

        @Field
        public void setPageSegMode​(String pageSegMode)
      • setMaxFileSizeToOcr

        @Field
        public void setMaxFileSizeToOcr​(long maxFileSizeToOcr)
      • setMinFileSizeToOcr

        @Field
        public void setMinFileSizeToOcr​(long minFileSizeToOcr)
      • setTimeout

        @Field
        public void setTimeout​(int timeout)
      • setOutputType

        @Field
        public void setOutputType​(String outputType)
      • setPreserveInterwordSpacing

        @Field
        public void setPreserveInterwordSpacing​(boolean preserveInterwordSpacing)
      • setEnableImageProcessing

        @Field
        public void setEnableImageProcessing​(int enableImageProcessing)
      • setImageMagickPath

        @Field
        public void setImageMagickPath​(String imageMagickPath)
      • setDensity

        @Field
        public void setDensity​(int density)
      • setDepth

        @Field
        public void setDepth​(int depth)
      • setColorspace

        @Field
        public void setColorspace​(String colorspace)
      • setFilter

        @Field
        public void setFilter​(String filter)
      • setResize

        @Field
        public void setResize​(int resize)
      • setApplyRotation

        @Field
        public void setApplyRotation​(boolean applyRotation)