Class OpenNLPDetector


  • public class OpenNLPDetector
    extends org.apache.tika.language.detect.LanguageDetector

    This is based on OpenNLP's language detector. However, we've built our own ProbingLanguageDetector and our own language models.

    To build our model, we followed OpenNLP's lead by using the (Leipzig corpus) as gathered and preprocessed ( big-data corpus ). We removed azj, plt, sun and zsm because our models couldn't sufficiently well distinguish them from related languages. We removed cmn in favor of the finer-grained zho-trad and zho-simp.

    We then added the following languages from cc-100: ben-rom (Bengali Romanized), ful, gla, gug, hau, hin-rom, ibo, ful, linm mya-zaw, nso, orm, quz, roh, srd, ssw, tam-rom, tel-rom, tsn, urd-rom, wol, yor.

    We ran our own train/devtest/test code because OpenNLPs required more sentences/data than were available for some languages.

    Please open an issue on our JIRA if we made mistakes and/or had misunderstandings in our design choices or if you need to have other languages added.

    Citations for the cc-100 corpus:

    Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020, pdf, bib.

    CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020, pdf, bib.

    • Constructor Detail

      • OpenNLPDetector

        public OpenNLPDetector()
    • Method Detail

      • loadModels

        public org.apache.tika.language.detect.LanguageDetector loadModels()
                                                                    throws IOException
        No-op. Models are loaded statically.
        Specified by:
        loadModels in class org.apache.tika.language.detect.LanguageDetector
        Returns:
        Throws:
        IOException
      • loadModels

        public org.apache.tika.language.detect.LanguageDetector loadModels​(Set<String> languages)
                                                                    throws IOException
        NOT SUPPORTED. Throws UnsupportedOperationException
        Specified by:
        loadModels in class org.apache.tika.language.detect.LanguageDetector
        Parameters:
        languages - list of target languages.
        Returns:
        Throws:
        IOException
      • hasModel

        public boolean hasModel​(String language)
        Specified by:
        hasModel in class org.apache.tika.language.detect.LanguageDetector
      • setPriors

        public org.apache.tika.language.detect.LanguageDetector setPriors​(Map<String,​Float> languageProbabilities)
                                                                   throws IOException
        NOT YET SUPPORTED. Throws UnsupportedOperationException
        Specified by:
        setPriors in class org.apache.tika.language.detect.LanguageDetector
        Parameters:
        languageProbabilities - Map from language to probability
        Returns:
        Throws:
        IOException
      • reset

        public void reset()
        Specified by:
        reset in class org.apache.tika.language.detect.LanguageDetector
      • addText

        public void addText​(char[] cbuf,
                            int off,
                            int len)
        This will buffer up to setMaxLength(int) and then ignore the rest of the text.
        Specified by:
        addText in class org.apache.tika.language.detect.LanguageDetector
        Parameters:
        cbuf - Character buffer
        off - Offset into cbuf to first character in the run of text
        len - Number of characters in the run of text.
      • detectAll

        public List<org.apache.tika.language.detect.LanguageResult> detectAll()
        Specified by:
        detectAll in class org.apache.tika.language.detect.LanguageDetector
      • setMaxLength

        public void setMaxLength​(int maxLength)
      • getSupportedLanguages

        public String[] getSupportedLanguages()