public class NGramTokenizerFactory extends Object implements TokenizerFactory, Serializable
NGramTokenizerFactory creates n-gram tokenizers
of a specified minimum and maximun length.
An NGramTokenizer is a tokenizer that returns the
character n-grams from a specified sequence between a minimum
and maximum length. Whitespace takes the default behavior from
Tokenizer.nextWhitespace(), returning a string consisting of
a single space character.
For example, the result of
new NGramTokenizer("abcd".toCharArray(),0,4,2,3).tokenize()
is the string array:
{ "ab", "bc", "cd", "abc", "bcd" }
N-gram tokenizer factories are serializable.
| Constructor and Description |
|---|
NGramTokenizerFactory(int minNGram,
int maxNGram)
Create an n-gram tokenizer factory with the specified minimum
and maximum n-gram lengths.
|
| Modifier and Type | Method and Description |
|---|---|
int |
maxNGram()
Returns the maximum n-gram length returned by this tokenizer
factory.
|
int |
minNGram()
Returns the minimum n-gram length returned by this tokenizer
factory.
|
Tokenizer |
tokenizer(char[] cs,
int start,
int length)
Returns an n-gram tokenizer for the specified characters
with the minimum and maximum n-gram lengths as specified
in the constructor.
|
String |
toString()
Returns a description of this n-gram tokenizer factory,
including minimum and maximum token lengths.
|
public NGramTokenizerFactory(int minNGram,
int maxNGram)
minNGram - Minimum n-gram length.maxNGram - Maximum n-gram length.IllegalArgumentException - If the minimum is greater than
the maximum or if the maximum is less than one.public int minNGram()
public int maxNGram()
public Tokenizer tokenizer(char[] cs, int start, int length)
tokenizer in interface TokenizerFactorycs - Underlying character array.start - Index of first character in array to tokenize.length - Number of characters to tokenize.Copyright © 2019 Alias-i, Inc.. All rights reserved.