public final class TokenizerUtils extends Object
| Modifier and Type | Field and Description |
|---|---|
static String |
END_TAG |
static String |
SEPARATORS |
static String |
START_TAG |
| Modifier and Type | Method and Description |
|---|---|
static String[] |
addStartAndEndTags(String[] unigram)
Adds
|
static String[] |
buildNGrams(String[] tokens,
int size)
This tokenizer uses the given tokens and then concatenates the words based
on size.
|
static String[] |
buildNGramsRange(String[] tokens,
int startSize,
int endSize)
Builds ngrams from a range of tokens, basically a concat of all the
buildNGrams(String[], int) calls within the range. |
static String |
concat(String[] tokens,
String delimiter)
Concats the given tokens with the given delimiter.
|
static String[] |
deduplicateTokens(String[] tokens)
Deduplicates the given tokens, but maintains the order.
|
static String[] |
internStrings(String[] strings)
Interns the given strings inplace.
|
static String[] |
internStrings(String[] strings,
StringPool pool)
Interns the given strings inplace with the given pool.
|
static String |
normalizeString(String token)
Normalizes the token:
- lower cases - removes not alphanumeric characters (since I'm german I have included äüöß as well). |
static String[] |
normalizeTokens(String[] tokens,
boolean removeEmpty)
Normalizes the tokens:
- lower cases - removes not alphanumeric characters (since I'm german I have included äüöß as well). |
static String[] |
nShinglesTokenize(String key,
int size)
N-shingles tokenizer.
|
static String[] |
qGramTokenize(String key,
int size)
q-gram tokenizer, which is basically a proxy to
nShinglesTokenize(String, int). |
static String[] |
removeEmpty(String[] arr)
Removes empty tokens from given array.
|
static String[] |
removeMatchingRegex(String regex,
String replacement,
String[] tokens,
boolean removeEmpty)
Applies given regex on tokens and may optionally delete when a token gets
empty.
|
static String[] |
whiteSpaceTokenize(String text)
Tokenizes on normal whitespaces "\\s+" in java regex.
|
static String[] |
whiteSpaceTokenizeNGrams(String text,
int size)
This tokenizer first splits on whitespaces and then concatenates the words
based on size.
|
static String[] |
wordTokenize(String text)
Tokenizes on several indicators of a word, regex is [
\r\n\t.,;:'\"()?!\\-/|]
|
static String[] |
wordTokenize(String text,
boolean keepSeperators)
Tokenizes like
wordTokenize(String) does, but keeps the seperators
as their own token if the argument is true. |
static String[] |
wordTokenize(String text,
String regex)
Tokenizes on several indicators of a word, regex to detect these must be
given.
|
public static final String END_TAG
public static final String START_TAG
public static final String SEPARATORS
public static String[] removeMatchingRegex(String regex, String replacement, String[] tokens, boolean removeEmpty)
public static String[] qGramTokenize(String key, int size)
nShinglesTokenize(String, int). These are nGrams based on
characters. If you want to use normal word tokenizers, then use
wordTokenize(String) for unigrams. To generate bigrams out of it
you need to call buildNGrams(String[], int).key - size - public static String[] nShinglesTokenize(String key, int size)
wordTokenize(String)
for unigrams. To generate bigrams out of it you need to call
buildNGrams(String[], int).public static String[] whiteSpaceTokenize(String text)
public static String[] deduplicateTokens(String[] tokens)
public static String[] wordTokenize(String text)
public static String[] wordTokenize(String text, boolean keepSeperators)
wordTokenize(String) does, but keeps the seperators
as their own token if the argument is true.public static String[] wordTokenize(String text, String regex)
public static String[] normalizeTokens(String[] tokens, boolean removeEmpty)
public static String normalizeString(String token)
public static String[] removeEmpty(String[] arr)
public static String[] whiteSpaceTokenizeNGrams(String text, int size)
public static String[] buildNGrams(String[] tokens, int size)
public static String[] buildNGramsRange(String[] tokens, int startSize, int endSize)
buildNGrams(String[], int) calls within the range. Both start and
end are inclusive.public static String[] internStrings(String[] strings)
strings - the strings to intern.public static String[] internStrings(String[] strings, StringPool pool)
strings - the strings to intern.pool - the string pool to use.public static String[] addStartAndEndTags(String[] unigram)
Copyright © 2016. All rights reserved.