Class TokenizerUtils


  • public final class TokenizerUtils
    extends java.lang.Object
    Nifty text utility for majorly tokenizing tasks.
    Author:
    thomas.jungblut
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.lang.String END_TAG  
      static java.lang.String SEPARATORS  
      static java.lang.String START_TAG  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String[] addStartAndEndTags​(java.lang.String[] unigram)
      Adds and to the beginning of the array and the end.
      static java.lang.String[] buildNGrams​(java.lang.String[] tokens, int size)
      This tokenizer uses the given tokens and then concatenates the words based on size.
      static java.lang.String[] buildNGramsRange​(java.lang.String[] tokens, int startSize, int endSize)
      Builds ngrams from a range of tokens, basically a concat of all the buildNGrams(String[], int) calls within the range.
      static java.lang.String concat​(java.lang.String[] tokens, java.lang.String delimiter)
      Concats the given tokens with the given delimiter.
      static java.lang.String[] deduplicateTokens​(java.lang.String[] tokens)
      Deduplicates the given tokens, but maintains the order.
      static java.lang.String[] internStrings​(java.lang.String[] strings)
      Interns the given strings inplace.
      static java.lang.String[] internStrings​(java.lang.String[] strings, StringPool pool)
      Interns the given strings inplace with the given pool.
      static java.lang.String normalizeString​(java.lang.String token)
      Normalizes the token:
      - lower cases
      - removes not alphanumeric characters (since I'm german I have included äüöß as well).
      static java.lang.String[] normalizeTokens​(java.lang.String[] tokens, boolean removeEmpty)
      Normalizes the tokens:
      - lower cases
      - removes not alphanumeric characters (since I'm german I have included äüöß as well).
      static java.lang.String[] nShinglesTokenize​(java.lang.String key, int size)
      N-shingles tokenizer.
      static java.lang.String[] numericsToHash​(java.lang.String[] tokens)
      Replaces all numerics with "#".
      static java.lang.String[] qGramTokenize​(java.lang.String key, int size)
      q-gram tokenizer, which is basically a proxy to nShinglesTokenize(String, int).
      static java.lang.String[] removeEmpty​(java.lang.String[] arr)
      Removes empty tokens from given array.
      static java.lang.String[] removeMatchingRegex​(java.lang.String regex, java.lang.String replacement, java.lang.String[] tokens, boolean removeEmpty)
      Applies given regex on tokens and may optionally delete when a token gets empty.
      static java.lang.String[] trim​(java.lang.String[] tokens)
      Trims the tokens using String.trim() and additionally removes non-breaking spaces.
      static java.lang.String[] whiteSpaceTokenize​(java.lang.String text)
      Tokenizes on normal whitespaces "\\s+" in java regex.
      static java.lang.String[] whiteSpaceTokenizeNGrams​(java.lang.String text, int size)
      This tokenizer first splits on whitespaces and then concatenates the words based on size.
      static java.lang.String[] wordTokenize​(java.lang.String text)
      Tokenizes on several indicators of a word, regex is [ \r\n\t.,;:'\"()?!\\-/|]
      static java.lang.String[] wordTokenize​(java.lang.String text, boolean keepSeperators)
      Tokenizes like wordTokenize(String) does, but keeps the seperators as their own token if the argument is true.
      static java.lang.String[] wordTokenize​(java.lang.String text, java.lang.String regex)
      Tokenizes on several indicators of a word, regex to detect these must be given.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • removeMatchingRegex

        public static java.lang.String[] removeMatchingRegex​(java.lang.String regex,
                                                             java.lang.String replacement,
                                                             java.lang.String[] tokens,
                                                             boolean removeEmpty)
        Applies given regex on tokens and may optionally delete when a token gets empty.
      • qGramTokenize

        public static java.lang.String[] qGramTokenize​(java.lang.String key,
                                                       int size)
        q-gram tokenizer, which is basically a proxy to nShinglesTokenize(String, int). These are nGrams based on characters. If you want to use normal word tokenizers, then use wordTokenize(String) for unigrams. To generate bigrams out of it you need to call buildNGrams(String[], int).
        Parameters:
        key -
        size -
        Returns:
      • nShinglesTokenize

        public static java.lang.String[] nShinglesTokenize​(java.lang.String key,
                                                           int size)
        N-shingles tokenizer. N-Shingles are nGrams based on characters. If you want to use normal word tokenizers, then use wordTokenize(String) for unigrams. To generate bigrams out of it you need to call buildNGrams(String[], int).
      • whiteSpaceTokenize

        public static java.lang.String[] whiteSpaceTokenize​(java.lang.String text)
        Tokenizes on normal whitespaces "\\s+" in java regex.
      • deduplicateTokens

        public static java.lang.String[] deduplicateTokens​(java.lang.String[] tokens)
        Deduplicates the given tokens, but maintains the order.
      • wordTokenize

        public static java.lang.String[] wordTokenize​(java.lang.String text)
        Tokenizes on several indicators of a word, regex is [ \r\n\t.,;:'\"()?!\\-/|]
      • wordTokenize

        public static java.lang.String[] wordTokenize​(java.lang.String text,
                                                      boolean keepSeperators)
        Tokenizes like wordTokenize(String) does, but keeps the seperators as their own token if the argument is true.
      • wordTokenize

        public static java.lang.String[] wordTokenize​(java.lang.String text,
                                                      java.lang.String regex)
        Tokenizes on several indicators of a word, regex to detect these must be given.
      • normalizeTokens

        public static java.lang.String[] normalizeTokens​(java.lang.String[] tokens,
                                                         boolean removeEmpty)
        Normalizes the tokens:
        - lower cases
        - removes not alphanumeric characters (since I'm german I have included äüöß as well).
      • normalizeString

        public static java.lang.String normalizeString​(java.lang.String token)
        Normalizes the token:
        - lower cases
        - removes not alphanumeric characters (since I'm german I have included äüöß as well).
      • removeEmpty

        public static java.lang.String[] removeEmpty​(java.lang.String[] arr)
        Removes empty tokens from given array. The empty slots will be filled with the follow-up tokens.
      • whiteSpaceTokenizeNGrams

        public static java.lang.String[] whiteSpaceTokenizeNGrams​(java.lang.String text,
                                                                  int size)
        This tokenizer first splits on whitespaces and then concatenates the words based on size.
      • buildNGrams

        public static java.lang.String[] buildNGrams​(java.lang.String[] tokens,
                                                     int size)
        This tokenizer uses the given tokens and then concatenates the words based on size.
      • buildNGramsRange

        public static java.lang.String[] buildNGramsRange​(java.lang.String[] tokens,
                                                          int startSize,
                                                          int endSize)
        Builds ngrams from a range of tokens, basically a concat of all the buildNGrams(String[], int) calls within the range. Both start and end are inclusive.
      • internStrings

        public static java.lang.String[] internStrings​(java.lang.String[] strings)
        Interns the given strings inplace.
        Parameters:
        strings - the strings to intern.
        Returns:
        an interned string array.
      • internStrings

        public static java.lang.String[] internStrings​(java.lang.String[] strings,
                                                       StringPool pool)
        Interns the given strings inplace with the given pool.
        Parameters:
        strings - the strings to intern.
        pool - the string pool to use.
        Returns:
        an interned string array.
      • addStartAndEndTags

        public static java.lang.String[] addStartAndEndTags​(java.lang.String[] unigram)
        Adds and to the beginning of the array and the end.
      • concat

        public static java.lang.String concat​(java.lang.String[] tokens,
                                              java.lang.String delimiter)
        Concats the given tokens with the given delimiter.
      • numericsToHash

        public static java.lang.String[] numericsToHash​(java.lang.String[] tokens)
        Replaces all numerics with "#".
      • trim

        public static java.lang.String[] trim​(java.lang.String[] tokens)
        Trims the tokens using String.trim() and additionally removes non-breaking spaces.