public abstract class WeightedEditDistance extends Object implements Distance<CharSequence>, Proximity<CharSequence>
WeightedEditDistance class implements both the
proximity and distance interfaces based on the negative proximity
weights assigned to independent atomic edit operations.
Weights on edit operations are scaled as log probabilities.
Practically speaking, this means that the larger the weight, the
more likely the edit operation; keep in mind that -1 is larger than
-3, representing 2-1 = 1/2 and
2-3 = 1/8 respectively on a linear
probability scale.
The log probability of a sequence of independent edits is the
sum of the log probabilities of the individual edits. Proximity
between strings s1 and s2 is defined as
the maximum sum of edit weights over sequences of edits that
convert s1 to s2.
Like the individual edit weights, proximity is scaled as a log probability of the complete edit. The larger the proximity, the closer the strings; again, keep in mind that -10 is larger than -20, representing roughly 1/1000 and 1/1,000,000 on the linear probability scale.
Distance is just negative proximity. This scales edit distances in the usual way, with distance of 3 between strings indicating they are further away from each other than strings at distance 1.25.
This class generalizes the behavior of the class
spell.EditDistance without extending it in the inheritance
sense. Weighted edit distance agrees with edit distance (up to
arithmetic precision) as a distance assuming the following weights:
match weight is 0, substitute, insert and delete weights are
-1, and the transposition weight is -1 if
transpositions are allowed in the edit distance and
Double.NEGATIVE_INFINITY otherwise.
If the substitution and transposition weights are symmetric and the insert and delete costs of a character are equal, then weighted edit distance will be symmetric.
If the match weight of all characters is zero, then the distance between a character sequence and itself will be zero.
If transpose weights are negative infinity so that transposition is
not allowed, and if the assignment of substitution weights forms a
metric (see Distance for a definition), and if delete and
insert weights are non-negative and equal for all characters, and
if match weights are all zero, then weighted edit distance will
form a proper metric. Other values may also form metrics, such as
a weight of -1 for all edits other than transpose.
A probabilistic relational model between strings is defined if
the weights are properly scaled as log probabilities. Because
probabilities are between 0 and 1, log probabilities will be
between negative infinity and zero. Proximity between two strings
in and out is defined by:
where the cost of the edit is defined to be:proximity(in,out) = Maxedit(in)=out log2 P(edit)
log2 P(edit)
= log2 P(edit0,...,editn-1)
~ log2 P(edit0) + ... + log P(editn-1)
The last line is an approximation assuming edits are
independent.
In order to create a proper probabilistic channel, exponentiated edit weights must sum to 1.0. This is not technically possible with a local model if transposition is allowed, because of boundary conditions and independence assumptions. It is possible to define a proper channel if transposition is off, and if all edit weights for a position (including all sequences of arbitrarily long insertions) sum to 1.0. In particular, if any edits at all are allowed (have finite weights), then there must be a non-zero weight assigned to matching, otherwise exponentiated edit weight sum would exceed 1.0. It is always possible to add an offset to normalize the values to a probability model (the offset will be negative if the sum exceeds 1.0 and positive if it falls below 1.0 and zero otherwise).
A fully probabilistic model would have to take the sum over all edits rather than the maximum. This class makes the so-called Viterbi approximation, assuming the full probability is close to that of the best probability, or at least proportional to it.
| Constructor and Description |
|---|
WeightedEditDistance()
Construct a weighted edit distance.
|
| Modifier and Type | Method and Description |
|---|---|
abstract double |
deleteWeight(char cDeleted)
Returns the weight of deleting the specified character.
|
double |
distance(CharSequence csIn,
CharSequence csOut)
Returns the weighted edit distance between the specified
character sequences.
|
abstract double |
insertWeight(char cInserted)
Returns the weight of inserting the specified character.
|
abstract double |
matchWeight(char cMatched)
Returns the weight of matching the specified character.
|
double |
proximity(CharSequence csIn,
CharSequence csOut)
Returns the weighted proximity between the specified character
sequences.
|
abstract double |
substituteWeight(char cDeleted,
char cInserted)
Returns the weight of substituting the inserted character for
the deleted character.
|
abstract double |
transposeWeight(char cFirst,
char cSecond)
Returns the weight of transposing the specified characters.
|
public WeightedEditDistance()
public double distance(CharSequence csIn, CharSequence csOut)
This method is thread safe and may be accessed concurrently if the abstract weighting methods are thread safe.
distance in interface Distance<CharSequence>csIn - First character sequence.csOut - Second character sequence.public double proximity(CharSequence csIn, CharSequence csOut)
This method is thread safe and may be accessed concurrently if the abstract weighting methods are thread safe.
proximity in interface Proximity<CharSequence>csIn - First character sequence.csOut - Second character sequence.public abstract double matchWeight(char cMatched)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cMatched - Character matched.public abstract double deleteWeight(char cDeleted)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cDeleted - Character deleted.public abstract double insertWeight(char cInserted)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cInserted - Character inserted.public abstract double substituteWeight(char cDeleted,
char cInserted)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cDeleted - Deleted character.cInserted - Inserted character.public abstract double transposeWeight(char cFirst,
char cSecond)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cFirst - First character in input.cSecond - Second character in input.Copyright © 2016 Alias-i, Inc.. All rights reserved.