public class TokenNGramTokenizerFactory extends Object implements TokenizerFactory, Serializable
TokenNGramTokenizerFactory wraps a base tokenizer to
produce token n-gram tokens of a specified size.
For example, suppose we have a regex tokenizer factory that generates tokens based on contiguous letter characters. We can use it to build a token n-gram tokenizer factory that generates token bigrams and trigrams made up of the tokens from the base tokenizer.
TokenizerFactory tf
= new RegExTokenizerFactory("\\S+");
TokenizerFactory ntf
= new TokenNGramTokenizerFactory(2,3,tf);
The sequences of tokens produced by tf for some
inputs are as follows.
The start and end positions are calculated based on the positions for the base tokens provided by the base tokenizer.
String Tokens "a""a b""a b""a b c""a b", "b c", "a b c""a b c d""a b", "b c", "c d", "a b c", "b c d"
| Constructor and Description |
|---|
TokenNGramTokenizerFactory(TokenizerFactory factory,
int min,
int max)
Construct a token n-gram tokenizer factory using the
specified base factory that produces n-grams within the
specified minimum and maximum length bounds.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
baseTokenizerFactory()
Return the base tokenizer factory used to generate
the underlying tokens from which n-grams are
generated.
|
int |
maxNGram()
Return the maximum n-gram length.
|
int |
minNGram()
Return the minimum n-gram length.
|
Tokenizer |
tokenizer(char[] cs,
int start,
int len)
Returns a tokenizer for the specified subsequence
of characters.
|
String |
toString() |
public TokenNGramTokenizerFactory(TokenizerFactory factory, int min, int max)
factory - Base tokenizer factory.min - Minimum n-gram length (inclusive).max - Maximum n-gram length (inclusive).IllegalArgumentException - If the minimum is less than 1 or
the maximum is less than the minimum.public int minNGram()
public int maxNGram()
public TokenizerFactory baseTokenizerFactory()
public Tokenizer tokenizer(char[] cs, int start, int len)
TokenizerFactorytokenizer in interface TokenizerFactorycs - Characters to tokenize.start - Index of first character to tokenize.len - Number of characters to tokenize.Copyright © 2016 Alias-i, Inc.. All rights reserved.