Class Icu4jEncodingDetector

  • All Implemented Interfaces:
    Serializable, org.apache.tika.detect.EncodingDetector

    public class Icu4jEncodingDetector
    extends Object
    implements org.apache.tika.detect.EncodingDetector
    See Also:
    Serialized Form
    • Constructor Detail

      • Icu4jEncodingDetector

        public Icu4jEncodingDetector()
    • Method Detail

      • detect

        public Charset detect​(InputStream input,
                              org.apache.tika.metadata.Metadata metadata)
                       throws IOException
        Specified by:
        detect in interface org.apache.tika.detect.EncodingDetector
        Throws:
        IOException
      • setStripMarkup

        @Field
        public void setStripMarkup​(boolean stripMarkup)
        Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector. The underlying detector may still apply its own stripping if this is set to false.
        Parameters:
        stripMarkup - whether or not to attempt to strip markup before sending the stream to the underlying detector
      • getStripMarkup

        public boolean getStripMarkup()
      • setMarkLimit

        @Field
        public void setMarkLimit​(int markLimit)
        How far into the stream to read for charset detection. Default is 12000.
        Parameters:
        markLimit -
      • getMarkLimit

        public int getMarkLimit()