public class JaccardDistance extends TokenizedDistance
JaccardDistance class implements a notion of
distance based on token overlap. The tokens are generated
from the character sequences being compared by a tokenizer
factory that is supplied at construction time. A distance of
zero (0) is a perfect match, a distance of
one (10 a perfect mismatch.
Suppose termSet(cs) is the set of tokens extracted from
the character sequence cs. With these terms,
the proximity underlying Jaccard distance is defined
as the percentage of tokens that appear in both
character sequences:
proximity(cs1,cs2)
= size(termSet(cs1) INTERSECT termSet(cs2))
/ size(termSet(cs1) UNION termSet(cs2))
Proximities run between 0 and 1. A proximity of 0 means the
character sequences share no terms in common and a proximity of 1
means the character sequences share all of their terms.
Distance is then defined in terms of proximity by subtraction.
Distances also run between 0 and 1. A distance of 0 means the character sequences share all of their terms, whereas a distance of 1 means they have no terms in common.distance(cs1,cs2) = 1 - proximity(cs1,cs2)
| Constructor and Description |
|---|
JaccardDistance(TokenizerFactory factory)
Construct an instance of Jaccard string distance using
the specified tokenizer factory.
|
| Modifier and Type | Method and Description |
|---|---|
double |
distance(CharSequence cSeq1,
CharSequence cSeq2)
Returns the Jaccard distance between the specified character
sequence.
|
double |
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the proximity between the specified character
sequences.
|
termFrequencyVector, tokenizerFactory, tokenSet, tokenSetpublic JaccardDistance(TokenizerFactory factory)
factory - Tokenizer factory for distance.public double distance(CharSequence cSeq1, CharSequence cSeq2)
cSeq1 - First character sequence.cSeq2 - Second character sequence.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
cSeq1 - First character sequence.cSeq2 - Second character sequence.Copyright © 2016 Alias-i, Inc.. All rights reserved.