commit 61737994f9a26c9dda16ff10a476b3b217381d95 Author: Gianna Badiali Date: Wed Jul 13 16:36:12 2016 +0000 [blender] Fix inconsistent lang operator behavior RB_ID=849492 .../com/twitter/common/text/language/LocaleUtil.java | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) commit c185665e12edeaa480cd14f68dd69c4fcfde65d9 Author: Eden Golshani Date: Wed Dec 2 18:09:17 2015 +0000 [Penguin] Removed deprecated LocaleUtil.getModifiedLanguageTag function RB_ID=750671 .../twitter/common/text/language/LocaleUtil.java | 24 ---------------------- 1 file changed, 24 deletions(-) commit 6b6effb39ba016dfaf674cd5b838cd04d4ef9f95 Author: Eden Golshani Date: Mon Sep 28 18:21:34 2015 +0000 Add Basque, Catalan, and Czech support to Penguin RB_ID=744299 TBR=true .../twitter/common/text/language/LocaleUtil.java | 42 ++++++---------------- 1 file changed, 11 insertions(+), 31 deletions(-) commit c42ab7f519bd186119609de3082bd58d45fd9ac5 Author: Eden Golshani Date: Thu Sep 17 18:49:35 2015 +0000 Refactors of Language Prior system, TwitterLanguageIdentifier, and Unknown Locales RB_ID=742642 TBR=true .../com/twitter/common/text/language/LocaleUtil.java | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) commit 1f672db87bc4ec32922d437bbfc8658bd2ac3c13 Author: Gianna Badiali Date: Thu Sep 10 01:11:00 2015 +0000 Update Japanese/Chinese Langid to use MarkovModelLanguageIdentifier RB_ID=720262 TBR=true .../twitter/common/text/language/LocaleUtil.java | 38 +++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) commit 7a9b6a892d9e4aa19739a1a848ce696c9a28daa1 Author: Eden Golshani Date: Wed Sep 2 17:01:13 2015 +0000 Add Latinized Hindi (hi-Latn) to Penguin Language Identifier RB_ID=731643 .../java/com/twitter/common/text/language/LocaleUtil.java | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) commit 8cd9385ed5a8e438091fb72ffff2c9f01d0c4ba8 Author: Eden Golshani Date: Wed Aug 12 04:13:46 2015 +0000 Migrate to Locale.toLanguageTag() inside Penguin RB_ID=727947 science/src/java/com/twitter/common/text/language/LocaleUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit 92642363d77184350a972745e7a6d2326080cd88 Author: Eden Golshani Date: Thu Aug 6 18:22:05 2015 +0000 removed reuse of Matcher object in LocaleUtil RB_ID=726333 science/src/java/com/twitter/common/text/language/LocaleUtil.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) commit 07fc7a85d0f3b453044b47306141362218d61738 Author: Eden Golshani Date: Wed Jul 29 06:18:41 2015 +0000 Overhaul of LocaleUtil to remove confusing functions and prep for new Locales like Latinized Hindi (hi-Latn) RB_ID=720054 .../twitter/common/text/language/LocaleUtil.java | 270 +++++++++++---------- 1 file changed, 136 insertions(+), 134 deletions(-) commit 1714a20e94a424ccfd8a0b86ad9bd3c46832be7a Author: Will Hohyon Ryu Date: Sat Dec 6 06:55:05 2014 +0000 Added Marathi Language identification RB_ID=532615 science/src/java/com/twitter/common/text/language/LocaleUtil.java | 2 ++ 1 file changed, 2 insertions(+) commit 325cf8ec5ec2fc3d4864a6c45cf6e20f0e43aae2 Author: Hohyon Ryu Date: Wed Oct 29 21:07:13 2014 +0000 Open Sourcing Penguin Language Identifier RB_ID=485219 .../text/language/LocaleUtil.java | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) commit 3e1599d81ecf49aad60450808ea1d5bb293f1baf Author: Hohyon Ryu Date: Thu May 29 17:42:47 2014 +0000 Made CharClassLanguageIdentifier thread-safe. RB_ID=366115 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit 7be9e9853de8db75cd4eadd7f76ee31173972f1e Author: Hohyon Ryu Date: Thu May 22 20:41:56 2014 +0000 Revert "Resubmit CharacterClassLanguageIdentifier simplification" RB_ID=365641 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit 660512616a9e059886b2ba34b09cd13d7a5a873e Author: Hohyon Ryu Date: Thu May 22 19:28:44 2014 +0000 Resubmit CharacterClassLanguageIdentifier simplification RB_ID=364805 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit b1a247c53bb6715624bd796c200d902a4d3ae5ad Author: Nik Johnson Date: Wed May 21 04:06:25 2014 +0000 Revert "Simplify CharacterClassLanguageIdentifier's UniqueChar-based identification to support notable characters shared across several languages." RB_ID=363547 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit 53f242a6e20a3faebea0a113cc68385249e97df3 Author: Hohyon Ryu Date: Tue May 20 23:58:25 2014 +0000 Automatically reformatted style to conform with Twitter Java style. RB_ID=361871 .../common_internal/text/language/LocaleUtil.java | 87 +++++++++++----------- 1 file changed, 43 insertions(+), 44 deletions(-) commit 4a5d2c1be9811bbc07fd17203cd9c8fc10c8d476 Author: Hohyon Ryu Date: Mon May 19 22:32:31 2014 +0000 Simplify CharacterClassLanguageIdentifier's UniqueChar-based identification to support notable characters shared across several languages. RB_ID=358439 .../common_internal/text/language/LocaleUtil.java | 25 ++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) commit 457b4f8f11577c469c2b8549e94cc34c6952858c Author: Hohyon Ryu Date: Tue May 13 20:15:09 2014 +0000 Fixed StopWords to normalize country code of input locale. RB_ID=356353 .../com/twitter/common_internal/text/language/LocaleUtil.java | 10 ++++++++++ 1 file changed, 10 insertions(+) commit 7fa4d949224ecc50750fa21f8c1e3a3df4712cba Author: Eden Golshani Date: Thu May 8 21:11:38 2014 +0000 Penguin v3 changes: created Arabic normalization and tokenization, fixed Latin Tokenizer All Penguin behavior remains the same except where Penguin V3 is explicitly specified to these tools. I have summarized the changes below. RB_ID=339201 .../common_internal/text/language/LocaleUtil.java | 24 +++++++++++----------- 1 file changed, 12 insertions(+), 12 deletions(-) commit 17a7a697b6b0d33e0f5b8845b331b7380784c559 Author: Hohyon Ryu Date: Mon May 5 20:55:53 2014 +0000 Add Romanian Language to Language Identification RB_ID=350295 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 1 + 1 file changed, 1 insertion(+) commit 07e7ac29d4b6f2c61f79cf7acfaa0f01252e0e50 Author: Hohyon Ryu Date: Fri May 2 18:28:35 2014 +0000 [Penguin] Incorporate user language priors to improve language identification RB_ID=347907 .../twitter/common_internal/text/language/LocaleUtil.java | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) commit 850b80624e6b5fd91815f1d8ba317737ec03fce7 Author: Hohyon Ryu Date: Fri Apr 25 22:52:19 2014 +0000 Add sr, bs, hr, and cy for language identification RB_ID=339159 .../twitter/common_internal/text/language/LocaleUtil.java | 12 ++++++++++++ 1 file changed, 12 insertions(+) commit 3da52a748bdfab8046df9b83edca158a4e6adc21 Author: Juan Caicedo Date: Thu Mar 27 22:08:17 2014 +0000 Added distinction between Traditional and Simplified Chinese in LanguageCount. RB_ID=318481 .../common_internal/text/language/LocaleUtil.java | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) commit abbdf2c1209e79e322daa113743cb1eb27a3f1c8 Author: Eden Golshani Date: Thu Mar 20 17:35:46 2014 +0000 Improvements to language identification for Arabic script languages - improved identification of Arabic script languages by detecting unique characters in Arabic-script languages, adding first Penguin support for detection of Sindhi (SN), Uyghur (UG), Sorani Kurdish (CKB), and Pashto (PS). - Urdu (UR) and Persian (FA) are also improved similarly - results from CharacterClassLanguageIdentifier for these languages are then hybridized in TwitterLanguageIdentifier with the MarkovIdentifier. Results from the MarkovIdentifier are preferred over a certain threshold, as the CharacterClassLanguageIdentifier isn't exhaustive for identification of these languages (i.e. text may contain non-standard characters for that language or short text may include only base-Arabic characters and not any language-specific characters). - tests added for all affected languages in CharacterClassLanguageIdentifierTest RB_ID=312629 .../common_internal/text/language/LocaleUtil.java | 30 ++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) commit 77057605f0f2a7bac15773004378a23fd36bd7df Author: Hohyon Ryu Date: Wed Mar 5 19:06:44 2014 +0000 [Penguin] Adds Simplified/Traditional Chinese Language Identification RB_ID=287385 .../common_internal/text/language/LocaleUtil.java | 53 +++++++++++++++++++--- 1 file changed, 47 insertions(+), 6 deletions(-) commit e37d2788464c7906176a5945578d32b9aa98fc4d Author: Lei Wang Date: Tue Apr 16 13:55:44 2013 -0700 Fix problem caused by a typo of UKRAINIAN locale. RB_ID=140395 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit 54b32ed3926061404e5b5e0c585647c09515a525 Author: Philip Youssef Date: Fri Mar 29 12:16:34 2013 -0700 New languages, tuned performance, improved accuracy for penguin language identification. RB_ID=134846 .../java/com/twitter/common_internal/text/language/LocaleUtil.java | 5 +++++ 1 file changed, 5 insertions(+) commit 20e05941427094720b6f2a97469fcd789bc5e3a6 Author: Philip Youssef Date: Wed Jan 30 08:32:00 2013 -0800 Updated Langauge detection to use tweet trained models. Added support for Ukranian(uk). RB_ID=120512 .../src/java/com/twitter/common_internal/text/language/LocaleUtil.java | 1 + 1 file changed, 1 insertion(+) commit e2ccaf9fadc9bac6e1939af5385823bbdcd9e99f Author: Pradhuman Jhala Date: Tue Jan 8 10:46:58 2013 -0800 adds MALAY and ESPERANTO to LocaleUtil RB_ID=117049 .../java/com/twitter/common_internal/text/language/LocaleUtil.java | 4 ++++ 1 file changed, 4 insertions(+) commit 19465c0046099ebefae3f4fe569155df418ff5f2 Author: Henna Kermani Date: Wed Sep 26 15:57:28 2012 -0700 Added isRTL flag and test code. RB_ID=88366 .../java/com/twitter/common_internal/text/language/LocaleUtil.java | 7 +++++++ 1 file changed, 7 insertions(+) commit 4c9b206141dde58dc2eff752a01db804c93fd5ee Author: keita Date: Wed Sep 19 16:23:59 2012 -0700 Support Traditional/Simplified Chinese in LocaleUtil RB_ID=87051 .../common_internal/text/language/LocaleUtil.java | 37 ++++++++++++++++------ 1 file changed, 27 insertions(+), 10 deletions(-) commit 49dd6f82685e8da7969f9edd624d814cacda119c Author: keita Date: Mon Sep 17 14:25:18 2012 -0700 Introduce ThriftLanguageUtil (to deprecate LanguageCode) RB_ID=85961 .../common_internal/text/language/LocaleUtil.java | 91 ++++++++++++---------- 1 file changed, 48 insertions(+), 43 deletions(-) commit 26577d24ac24bdef79f3988cae39dd931952198e Author: keita Date: Thu Aug 16 11:05:39 2012 -0700 Add early termination option, and pre-compute BCP47 codes. RB_ID=80837 .../twitter/common_internal/text/language/LocaleUtil.java | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) commit 03a8097dfa1d3fa66ab0533195988ccf8da5726e Author: keita Date: Wed Aug 15 15:39:47 2012 -0700 Implement HitHighlighter and CamelCaseTokenizer RB_ID=79039 .../common_internal/text/language/LocaleUtil.java | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) commit 25c835e76c254061553e436b13dded962e1741a0 Author: Delip Rao Date: Fri Jul 6 16:06:13 2012 -0700 Add stopword detection in 20+ languages in Penguin Background: Stopword (i.e. words like "a", "an", "the", "of", "him", etc) filtering is commonly used search, ads, and trends. We need Penguin text processing library to do that and also be able to do that in many languages, not just English. This change adds the required stop words and provides a helper class and methods to test if a word is a stop word. The stop words are loaded from resources since the total size of the related files is just 100K. The stop word lists themselves are derived from http://www.ranks.nl/resources/stopwords.html RB_ID=73523 .../com/twitter/common_internal/text/language/LocaleUtil.java | 11 +++++++++++ 1 file changed, 11 insertions(+) commit 4985e01a0888b12620cb531a8e923cab02132404 Author: keita Date: Wed Jun 13 10:34:11 2012 -0700 Refactor Penguin's language identifier RB_ID=68259 .../common_internal/text/language/LocaleUtil.java | 86 ++++++++++++++++++++++ 1 file changed, 86 insertions(+)