Class OcrType

  • All Implemented Interfaces:
    ParameterInterface

    public class OcrType
    extends Object
    implements ParameterInterface
     <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">The "OCR" web service can be used to run character recognition in PDF documents or images.
                         If recognition is run on images, they will be converted to PDF documents. More specifically, a page will be generated for each image in the PDF document, with this page containing the original image and a text layer with the recognized text.
                         Character recognition on PDF documents will only work with documents that do not contain text already. Normally, these will be documents that were generated by scanners and that only have an image per page in the PDF document.
                     </p>
     

    Java class for OcrType complex type

    .

    The following schema fragment specifies the expected content contained within this class.

    
     <complexType name="OcrType">
       <complexContent>
         <restriction base="{http://www.w3.org/2001/XMLSchema}anyType">
           <all>
             <element name="page" type="{http://schema.webpdf.de/1.0/operation}OcrPageType" minOccurs="0"/>
             <element name="pdfa" type="{http://schema.webpdf.de/1.0/operation}PdfaType" minOccurs="0"/>
             <element name="optimization" type="{http://schema.webpdf.de/1.0/operation}ImageOptimizationType" minOccurs="0"/>
           </all>
           <attribute name="language" type="{http://schema.webpdf.de/1.0/operation}OcrLanguageType" default="eng" />
           <attribute name="outputFormat" default="pdf">
             <simpleType>
               <restriction base="{http://schema.webpdf.de/1.0/operation}OcrOutputType">
               </restriction>
             </simpleType>
           </attribute>
           <attribute name="checkResolution" type="{http://www.w3.org/2001/XMLSchema}boolean" default="true" />
           <attribute name="imageDpi" default="200">
             <simpleType>
               <restriction base="{http://schema.webpdf.de/1.0/operation}DpiType">
               </restriction>
             </simpleType>
           </attribute>
           <attribute name="forceEachPage" type="{http://www.w3.org/2001/XMLSchema}boolean" default="false" />
           <attribute name="normalizePageRotation" type="{http://www.w3.org/2001/XMLSchema}boolean" default="false" />
           <attribute name="failOnWarning" type="{http://www.w3.org/2001/XMLSchema}boolean" default="false" />
           <attribute name="jpegQuality" default="75">
             <simpleType>
               <restriction base="{http://www.w3.org/2001/XMLSchema}int">
                 <minInclusive value="0"/>
                 <maxInclusive value="100"/>
               </restriction>
             </simpleType>
           </attribute>
           <attribute name="ocrMode" type="{http://schema.webpdf.de/1.0/operation}OcrModeType" default="pageSegments" />
         </restriction>
       </complexContent>
     </complexType>
     
    • Field Detail

      • language

        protected OcrLanguageType language
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Used to specify the language for the output document (PDF/image). The language must be defined for the character recognition operation (OCR) so that the "special characters" of the respective language (e.g. "üäö" in German) can be recognized better. At present, the following languages are supported:
                                 <ul><li>eng = English</li><li>fra = French</li><li>spa = Spanish</li><li>deu = German</li><li>ita = Italian</li></ul></p>
         
      • outputFormat

        protected OcrOutputType outputFormat
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Different output formats can be created during character recognition. Generally, the document is generated as a PDF document, but the output can also be as an ASCII document or an XML document if desired (HOCR).
                                 <ul><li>text = Text</li><li>hocr = XML (hOCR)</li><li>pdf = PDF</li></ul></p>
         
      • checkResolution

        protected Boolean checkResolution
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If "true," then the DPI resolution of the output file will be checked. Resolutions of less than 200 DPI are rejected in this check because as a rule, they do not produce good results for character recognition.</p>
         
      • imageDpi

        protected Integer imageDpi
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Used to set the minimum resolution images will be embedded with in resulting PDF documents. When a value of 0 is set for this parameter, the images shall be embedded using resolutions and dimensions as close as possible to the original source images.</p>
         
      • forceEachPage

        protected Boolean forceEachPage
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If a PDF document contains text content on any page, the web service will refuse to run character recognition again. If, however, a value of "true" is passed for this option, all the pages in the document will be considered individually and character recognition will be run on all pages that do not contain text (layers) so that a new layer with text will be generated for them.</p>
         
      • normalizePageRotation

        protected Boolean normalizePageRotation
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If "true", then, for the recognition of a rotated text, the system will attempt to rotate the page in such a way that the text in the document will not appear to be rotated and will be shown "upright."</p>
         
      • failOnWarning

        protected Boolean failOnWarning
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If "true", character recognition will fail even in the event of warnings that do not prevent recognition, but that make it very unlikely for a meaningful result to be generated.</p>
         
      • jpegQuality

        protected Integer jpegQuality
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">A percentage that sets the compression ratio and influences the quality of JPEG images, that shall be embedded in resulting PDF documents. Higher values will result in less compressed images of higher quality.</p>
         
      • ocrMode

        protected OcrModeType ocrMode
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Specifies the mode used to find structured text on the pages. Depending on which mode is chosen, different requirements are set for the text and different assumptions are made about the text.
                                 <ul><li>pageSegments = The text on the page is clearly structured and decomposable into clear paragraphs and layout segments. Overlapping of text elements/lines does not occur. Headings and thus texts with deviating text sizes and font set, could be present.</li><li>column = The text is arranged on the pages in several, more or less uniform columns, next to each other. Font and text size are mostly uniform.</li><li>unfiltered = No assumptions are made about the text, any letters that can be found are recognized as such, regardless of whether they can be assigned to a text column, or line, or even a word. Font size and typeface can vary absolutely and texts are not necessarily arranged in clearly recognizable columns or according to a fixed layout. Texts and lines can overlap. (This mode usually recognizes more text (especially with more complex layouts), but usually also generates the most error detections, since no result is sorted out due to its deviation from the norm.</li></ul></p>
         
    • Constructor Detail

      • OcrType

        public OcrType()
    • Method Detail

      • getPage

        public OcrPageType getPage()
        Gets the value of the page property.
        Returns:
        possible object is OcrPageType
      • setPage

        public void setPage​(OcrPageType value)
        Sets the value of the page property.
        Parameters:
        value - allowed object is OcrPageType
      • isSetPage

        public boolean isSetPage()
      • getPdfa

        public PdfaType getPdfa()
        Gets the value of the pdfa property.
        Returns:
        possible object is PdfaType
      • setPdfa

        public void setPdfa​(PdfaType value)
        Sets the value of the pdfa property.
        Parameters:
        value - allowed object is PdfaType
      • isSetPdfa

        public boolean isSetPdfa()
      • isSetOptimization

        public boolean isSetOptimization()
      • getLanguage

        public OcrLanguageType getLanguage()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Used to specify the language for the output document (PDF/image). The language must be defined for the character recognition operation (OCR) so that the "special characters" of the respective language (e.g. "üäö" in German) can be recognized better. At present, the following languages are supported:
                                 <ul><li>eng = English</li><li>fra = French</li><li>spa = Spanish</li><li>deu = German</li><li>ita = Italian</li></ul></p>
         
        Returns:
        possible object is OcrLanguageType
      • isSetLanguage

        public boolean isSetLanguage()
      • getOutputFormat

        public OcrOutputType getOutputFormat()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Different output formats can be created during character recognition. Generally, the document is generated as a PDF document, but the output can also be as an ASCII document or an XML document if desired (HOCR).
                                 <ul><li>text = Text</li><li>hocr = XML (hOCR)</li><li>pdf = PDF</li></ul></p>
         
        Returns:
        possible object is OcrOutputType
      • isSetOutputFormat

        public boolean isSetOutputFormat()
      • isCheckResolution

        public boolean isCheckResolution()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If "true," then the DPI resolution of the output file will be checked. Resolutions of less than 200 DPI are rejected in this check because as a rule, they do not produce good results for character recognition.</p>
         
        Returns:
        possible object is Boolean
      • setCheckResolution

        public void setCheckResolution​(boolean value)
        Sets the value of the checkResolution property.
        Parameters:
        value - allowed object is Boolean
        See Also:
        isCheckResolution()
      • isSetCheckResolution

        public boolean isSetCheckResolution()
      • unsetCheckResolution

        public void unsetCheckResolution()
      • getImageDpi

        public int getImageDpi()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Used to set the minimum resolution images will be embedded with in resulting PDF documents. When a value of 0 is set for this parameter, the images shall be embedded using resolutions and dimensions as close as possible to the original source images.</p>
         
        Returns:
        possible object is Integer
      • setImageDpi

        public void setImageDpi​(int value)
        Sets the value of the imageDpi property.
        Parameters:
        value - allowed object is Integer
        See Also:
        getImageDpi()
      • isSetImageDpi

        public boolean isSetImageDpi()
      • unsetImageDpi

        public void unsetImageDpi()
      • isForceEachPage

        public boolean isForceEachPage()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If a PDF document contains text content on any page, the web service will refuse to run character recognition again. If, however, a value of "true" is passed for this option, all the pages in the document will be considered individually and character recognition will be run on all pages that do not contain text (layers) so that a new layer with text will be generated for them.</p>
         
        Returns:
        possible object is Boolean
      • setForceEachPage

        public void setForceEachPage​(boolean value)
        Sets the value of the forceEachPage property.
        Parameters:
        value - allowed object is Boolean
        See Also:
        isForceEachPage()
      • isSetForceEachPage

        public boolean isSetForceEachPage()
      • unsetForceEachPage

        public void unsetForceEachPage()
      • isNormalizePageRotation

        public boolean isNormalizePageRotation()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If "true", then, for the recognition of a rotated text, the system will attempt to rotate the page in such a way that the text in the document will not appear to be rotated and will be shown "upright."</p>
         
        Returns:
        possible object is Boolean
      • setNormalizePageRotation

        public void setNormalizePageRotation​(boolean value)
        Sets the value of the normalizePageRotation property.
        Parameters:
        value - allowed object is Boolean
        See Also:
        isNormalizePageRotation()
      • isSetNormalizePageRotation

        public boolean isSetNormalizePageRotation()
      • unsetNormalizePageRotation

        public void unsetNormalizePageRotation()
      • isFailOnWarning

        public boolean isFailOnWarning()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">If "true", character recognition will fail even in the event of warnings that do not prevent recognition, but that make it very unlikely for a meaningful result to be generated.</p>
         
        Returns:
        possible object is Boolean
      • setFailOnWarning

        public void setFailOnWarning​(boolean value)
        Sets the value of the failOnWarning property.
        Parameters:
        value - allowed object is Boolean
        See Also:
        isFailOnWarning()
      • isSetFailOnWarning

        public boolean isSetFailOnWarning()
      • unsetFailOnWarning

        public void unsetFailOnWarning()
      • getJpegQuality

        public int getJpegQuality()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">A percentage that sets the compression ratio and influences the quality of JPEG images, that shall be embedded in resulting PDF documents. Higher values will result in less compressed images of higher quality.</p>
         
        Returns:
        possible object is Integer
      • setJpegQuality

        public void setJpegQuality​(int value)
        Sets the value of the jpegQuality property.
        Parameters:
        value - allowed object is Integer
        See Also:
        getJpegQuality()
      • isSetJpegQuality

        public boolean isSetJpegQuality()
      • unsetJpegQuality

        public void unsetJpegQuality()
      • getOcrMode

        public OcrModeType getOcrMode()
         <?xml version="1.0" encoding="UTF-8"?><p xmlns:p146669_="https://jakarta.ee/xml/ns/jaxb" xmlns:p388438_="urn:jaxb.jvnet.org:plugin:inheritance" xmlns:tns="http://schema.webpdf.de/1.0/operation" xmlns:xs="http://www.w3.org/2001/XMLSchema">Specifies the mode used to find structured text on the pages. Depending on which mode is chosen, different requirements are set for the text and different assumptions are made about the text.
                                 <ul><li>pageSegments = The text on the page is clearly structured and decomposable into clear paragraphs and layout segments. Overlapping of text elements/lines does not occur. Headings and thus texts with deviating text sizes and font set, could be present.</li><li>column = The text is arranged on the pages in several, more or less uniform columns, next to each other. Font and text size are mostly uniform.</li><li>unfiltered = No assumptions are made about the text, any letters that can be found are recognized as such, regardless of whether they can be assigned to a text column, or line, or even a word. Font size and typeface can vary absolutely and texts are not necessarily arranged in clearly recognizable columns or according to a fixed layout. Texts and lines can overlap. (This mode usually recognizes more text (especially with more complex layouts), but usually also generates the most error detections, since no result is sorted out due to its deviation from the norm.</li></ul></p>
         
        Returns:
        possible object is OcrModeType
      • setOcrMode

        public void setOcrMode​(OcrModeType value)
        Sets the value of the ocrMode property.
        Parameters:
        value - allowed object is OcrModeType
        See Also:
        getOcrMode()
      • isSetOcrMode

        public boolean isSetOcrMode()