Package de.jungblut.nlp
Class TokenizerUtils
- java.lang.Object
-
- de.jungblut.nlp.TokenizerUtils
-
public final class TokenizerUtils extends java.lang.ObjectNifty text utility for majorly tokenizing tasks.- Author:
- thomas.jungblut
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.StringEND_TAGstatic java.lang.StringSEPARATORSstatic java.lang.StringSTART_TAG
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.lang.String[]addStartAndEndTags(java.lang.String[] unigram)Addsand to the beginning of the array and the end. static java.lang.String[]buildNGrams(java.lang.String[] tokens, int size)This tokenizer uses the given tokens and then concatenates the words based on size.static java.lang.String[]buildNGramsRange(java.lang.String[] tokens, int startSize, int endSize)Builds ngrams from a range of tokens, basically a concat of all thebuildNGrams(String[], int)calls within the range.static java.lang.Stringconcat(java.lang.String[] tokens, java.lang.String delimiter)Concats the given tokens with the given delimiter.static java.lang.String[]deduplicateTokens(java.lang.String[] tokens)Deduplicates the given tokens, but maintains the order.static java.lang.String[]internStrings(java.lang.String[] strings)Interns the given strings inplace.static java.lang.String[]internStrings(java.lang.String[] strings, StringPool pool)Interns the given strings inplace with the given pool.static java.lang.StringnormalizeString(java.lang.String token)Normalizes the token:
- lower cases
- removes not alphanumeric characters (since I'm german I have included äüöß as well).static java.lang.String[]normalizeTokens(java.lang.String[] tokens, boolean removeEmpty)Normalizes the tokens:
- lower cases
- removes not alphanumeric characters (since I'm german I have included äüöß as well).static java.lang.String[]nShinglesTokenize(java.lang.String key, int size)N-shingles tokenizer.static java.lang.String[]numericsToHash(java.lang.String[] tokens)Replaces all numerics with "#".static java.lang.String[]qGramTokenize(java.lang.String key, int size)q-gram tokenizer, which is basically a proxy tonShinglesTokenize(String, int).static java.lang.String[]removeEmpty(java.lang.String[] arr)Removes empty tokens from given array.static java.lang.String[]removeMatchingRegex(java.lang.String regex, java.lang.String replacement, java.lang.String[] tokens, boolean removeEmpty)Applies given regex on tokens and may optionally delete when a token gets empty.static java.lang.String[]trim(java.lang.String[] tokens)Trims the tokens usingString.trim()and additionally removes non-breaking spaces.static java.lang.String[]whiteSpaceTokenize(java.lang.String text)Tokenizes on normal whitespaces "\\s+" in java regex.static java.lang.String[]whiteSpaceTokenizeNGrams(java.lang.String text, int size)This tokenizer first splits on whitespaces and then concatenates the words based on size.static java.lang.String[]wordTokenize(java.lang.String text)Tokenizes on several indicators of a word, regex is [ \r\n\t.,;:'\"()?!\\-/|]static java.lang.String[]wordTokenize(java.lang.String text, boolean keepSeperators)Tokenizes likewordTokenize(String)does, but keeps the seperators as their own token if the argument is true.static java.lang.String[]wordTokenize(java.lang.String text, java.lang.String regex)Tokenizes on several indicators of a word, regex to detect these must be given.
-
-
-
Field Detail
-
END_TAG
public static final java.lang.String END_TAG
- See Also:
- Constant Field Values
-
START_TAG
public static final java.lang.String START_TAG
- See Also:
- Constant Field Values
-
SEPARATORS
public static final java.lang.String SEPARATORS
- See Also:
- Constant Field Values
-
-
Method Detail
-
removeMatchingRegex
public static java.lang.String[] removeMatchingRegex(java.lang.String regex, java.lang.String replacement, java.lang.String[] tokens, boolean removeEmpty)Applies given regex on tokens and may optionally delete when a token gets empty.
-
qGramTokenize
public static java.lang.String[] qGramTokenize(java.lang.String key, int size)q-gram tokenizer, which is basically a proxy tonShinglesTokenize(String, int). These are nGrams based on characters. If you want to use normal word tokenizers, then usewordTokenize(String)for unigrams. To generate bigrams out of it you need to callbuildNGrams(String[], int).- Parameters:
key-size-- Returns:
-
nShinglesTokenize
public static java.lang.String[] nShinglesTokenize(java.lang.String key, int size)N-shingles tokenizer. N-Shingles are nGrams based on characters. If you want to use normal word tokenizers, then usewordTokenize(String)for unigrams. To generate bigrams out of it you need to callbuildNGrams(String[], int).
-
whiteSpaceTokenize
public static java.lang.String[] whiteSpaceTokenize(java.lang.String text)
Tokenizes on normal whitespaces "\\s+" in java regex.
-
deduplicateTokens
public static java.lang.String[] deduplicateTokens(java.lang.String[] tokens)
Deduplicates the given tokens, but maintains the order.
-
wordTokenize
public static java.lang.String[] wordTokenize(java.lang.String text)
Tokenizes on several indicators of a word, regex is [ \r\n\t.,;:'\"()?!\\-/|]
-
wordTokenize
public static java.lang.String[] wordTokenize(java.lang.String text, boolean keepSeperators)Tokenizes likewordTokenize(String)does, but keeps the seperators as their own token if the argument is true.
-
wordTokenize
public static java.lang.String[] wordTokenize(java.lang.String text, java.lang.String regex)Tokenizes on several indicators of a word, regex to detect these must be given.
-
normalizeTokens
public static java.lang.String[] normalizeTokens(java.lang.String[] tokens, boolean removeEmpty)Normalizes the tokens:
- lower cases
- removes not alphanumeric characters (since I'm german I have included äüöß as well).
-
normalizeString
public static java.lang.String normalizeString(java.lang.String token)
Normalizes the token:
- lower cases
- removes not alphanumeric characters (since I'm german I have included äüöß as well).
-
removeEmpty
public static java.lang.String[] removeEmpty(java.lang.String[] arr)
Removes empty tokens from given array. The empty slots will be filled with the follow-up tokens.
-
whiteSpaceTokenizeNGrams
public static java.lang.String[] whiteSpaceTokenizeNGrams(java.lang.String text, int size)This tokenizer first splits on whitespaces and then concatenates the words based on size.
-
buildNGrams
public static java.lang.String[] buildNGrams(java.lang.String[] tokens, int size)This tokenizer uses the given tokens and then concatenates the words based on size.
-
buildNGramsRange
public static java.lang.String[] buildNGramsRange(java.lang.String[] tokens, int startSize, int endSize)Builds ngrams from a range of tokens, basically a concat of all thebuildNGrams(String[], int)calls within the range. Both start and end are inclusive.
-
internStrings
public static java.lang.String[] internStrings(java.lang.String[] strings)
Interns the given strings inplace.- Parameters:
strings- the strings to intern.- Returns:
- an interned string array.
-
internStrings
public static java.lang.String[] internStrings(java.lang.String[] strings, StringPool pool)Interns the given strings inplace with the given pool.- Parameters:
strings- the strings to intern.pool- the string pool to use.- Returns:
- an interned string array.
-
addStartAndEndTags
public static java.lang.String[] addStartAndEndTags(java.lang.String[] unigram)
Addsand to the beginning of the array and the end.
-
concat
public static java.lang.String concat(java.lang.String[] tokens, java.lang.String delimiter)Concats the given tokens with the given delimiter.
-
numericsToHash
public static java.lang.String[] numericsToHash(java.lang.String[] tokens)
Replaces all numerics with "#".
-
trim
public static java.lang.String[] trim(java.lang.String[] tokens)
Trims the tokens usingString.trim()and additionally removes non-breaking spaces.
-
-