public class TfIdfDistance extends TokenizedDistance implements ObjectHandler<CharSequence>
TfIdfDistance class provides a string distance
based on term frequency (TF) and inverse document frequency (IDF).
The method distance(CharSequence,CharSequence) will return
results in the range between 0 (perfect match) and
1 (no match) inclusive; the method proximity(CharSequence,CharSequence) runs in the opposite
direction, returning 0 for no match and 1
for a perfect match. Full details are provided below.
Terms are produced from the character sequences being compared by a tokenizer factory fixed at construction time. These terms form the dimensions of vectors whose values are the counts for the terms in the strings being compared.
The raw term frequencies are adjusted in scale and by inverse
document frequency. The resulting term vectors are then compared
by one minus their cosine. Because the term vectors contain only
positive values, the result is a distance between zero
(0), for completely dissimilar strings, to one
(1), for character-by-character identical strings.
The inverse document frequencies are defined over a collection
of documents. The collection of documents must be provided to this
class one at a time through the handle(CharSequence) method.
Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
Suppose we have a collection docs of n
strings, which we will call documents in keeping with tradition.
Further let df(t,docs) be the document frequency of
token t, that is, the number of documents in which the
token t appears. Then the inverse document frequency
(IDF) of t is defined by:
idf(t,docs) = sqrt(log(n/df(t,docs)))
If the document frequency df(t,docs) of a term is
zero, then idf(t,docs) is set to zero. As a result,
only terms that appeared in at least one training document are
used during comparison.
The term vector for a string is then defined by its term
frequencies. If count(t,cs) is the count of term
t in character sequence cs, then
the term frequency (TF) is defined by:
tf(t,cs) = sqrt(count(t,cs))
The term-frequency/inverse-document frequency (TF/IDF) vector
tfIdf(cs,docs) for a character sequence cs
over a collection of documents ds has a value
tfIdf(cs,docs)(t) for term t defined by:
tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)
The proximity between character sequences cs1 and
cs2 is defined as the cosine of their TF/IDF
vectors:
dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))
Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:
cos(x,y) = x . y / ( |x| * |y| )
where dot products are defined by:
x . y = Σi x[i] * y[i]
and length is defined by:
|x| = sqrt(x . x)
Distance is then just 1 minus the proximity value.
distance(cs1,cs2) = 1 - proximity(cs1,cs2)
org.apache.lucene.search.Similarity Class Documentation.
| Constructor and Description |
|---|
TfIdfDistance(TokenizerFactory tokenizerFactory)
Construct an instance of TF/IDF string distance based on the
specified tokenizer factory.
|
| Modifier and Type | Method and Description |
|---|---|
double |
distance(CharSequence cSeq1,
CharSequence cSeq2)
Return the TF/IDF distance between the specified character
sequences.
|
int |
docFrequency(String term)
Returns the number of training documents that contained
the specified term.
|
void |
handle(CharSequence cSeq)
Add the specified character sequence as a document for training.
|
double |
idf(String term)
Return the inverse-document frequency for the specified
term.
|
int |
numDocuments()
Returns the total number of training documents.
|
int |
numTerms()
Returns the number of terms that have been seen
during training.
|
double |
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the TF/IDF proximity between the specified character
sequences.
|
Set<String> |
termSet()
Returns the set of known terms for this distance.
|
termFrequencyVector, tokenizerFactory, tokenSet, tokenSetpublic TfIdfDistance(TokenizerFactory tokenizerFactory)
tokenizerFactory - Tokenizer factory for this distance.public void handle(CharSequence cSeq)
handle in interface ObjectHandler<CharSequence>cSeq - Characters to trai.public double distance(CharSequence cSeq1, CharSequence cSeq2)
distance in interface Distance<CharSequence>cSeq1 - First character sequence.cSeq2 - Second character sequence.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
proximity in interface Proximity<CharSequence>cSeq1 - First character sequence.cSeq2 - Second character sequence.public int docFrequency(String term)
term - Term to test.public double idf(String term)
term - The term whose IDF is returned.public int numDocuments()
public int numTerms()
public Set<String> termSet()
Copyright © 2019 Alias-i, Inc.. All rights reserved.