public abstract class TokenizedDistance extends Object implements Distance<CharSequence>, Proximity<CharSequence>
TokenizedDistance class provides an underlying
implementation of string distance based on comparing sets of
tokens. It holds a tokenizer factory and provides convenience
methods for extracting tokens from the input.
The method tokenSet(CharSequence) provides the set of
tokens derived by tokenizing the specified character sequence. The
method termFrequencyVector(CharSequence) provides a
mapping from tokens extracted by a tokenizer to integer counts.
| Constructor and Description |
|---|
TokenizedDistance(TokenizerFactory tokenizerFactory)
Construct a tokenized distance from the specified tokenizer
factory.
|
| Modifier and Type | Method and Description |
|---|---|
ObjectToCounterMap<String> |
termFrequencyVector(CharSequence cSeq)
Return the mapping from terms to their counts derived from
the specified character sequence using the tokenizer factory
in th is class.
|
TokenizerFactory |
tokenizerFactory()
Return the tokenizer factory for this tokenized distance.
|
Set<String> |
tokenSet(char[] cs,
int start,
int length)
Return the set of tokens produced by the specified character
slice using the tokenizer for this distance measure.
|
Set<String> |
tokenSet(CharSequence cSeq)
Return the set of tokens produced by the specified character
sequence using the tokenizer for this distance measure.
|
public TokenizedDistance(TokenizerFactory tokenizerFactory)
tokenizerFactory - Tokenizer for this distance.public TokenizerFactory tokenizerFactory()
public Set<String> tokenSet(CharSequence cSeq)
cSeq - Character sequence to tokenize.public Set<String> tokenSet(char[] cs, int start, int length)
cs - Underlying array of characters.start - Index of first character in slice.length - Length of slice.IndexOutOfBoundsException - If the start index is
not within the underlying array, or if the start index
plus the length minus one is not within the underlying
array.public ObjectToCounterMap<String> termFrequencyVector(CharSequence cSeq)
cSeq - Character sequence to tokenize.Copyright © 2019 Alias-i, Inc.. All rights reserved.