public class EditDistance extends Object implements Distance<CharSequence>, Proximity<CharSequence>
EditDistance class implements the standard notion
of edit distance, with or without transposition. The distance
without transposition is known as Levenshtein distance, and
with transposition as Damerau-Levenstein distance (see below about
distance-like metric properties).
The edit distance between two strings is defined to be the minimum number of non-matching substring edits that is required to turn one string into the other. The available edits and their corresponding input and output substrings are:
Examples of minimal edit sequences are as follows, with distances as indicated
Operation Input Output Notation Match "a" "a" m(a) Insert "" "a" i(a) Delete "a" "" d(a) Substitute "a" "b" s(b,a) Transpose "ab" "ba" t(ab)
Input
Output
Edit Sequence (w. Transp)
Dist (w. Transp)
it it m(i) m(t) 0
gage gauge m(g) m(a) i(u) m(g) m(e) 1
thhe the m(t) d(h) m(h) m(e) 1
tenser tensor m(t) m(e) m(n) m(s) s(o,e) m(r) 1
hte the d(h) m(t) i(h) m(e) [t(ht) m(e)] 2 [1]
htne then d(h) m(t) s(h,n) m(e) i(n) [t(ht) t(ne)] 3 [2]
pwn4g ownage s(o,p) m(w) m(n) s(a,4) m(g) i(e) 3
Note that, in general, there will be more than one way to edit a
string into another string. For instance, a delete and insert may
replace a transposition, so that "hte" becomes "the" through edits
"d(h) m(t) i(h) m(e)" or "i(t) m(h) d(t) m(e)", as well as many
many more, such as "d(h) d(t) d(e) i(t) i(h) i(e)".
Edit distance without transposition defines a proper metric.
Recall that a distance measure d(x,y) forms a metric
if for all x, y, z, we have (1) d(x,y) >=
0, (2) d(x,y) = d(y,x), (3) d(x,x) =
0, and (4) d(x,y) + d(y,z) >= d(x,z). All
of these properties are easy to verify. But with transposition, we
have strings such as AB, BA and
ACB. With transposition, d(AB,BA)=1,
d(BA,BCA)=1, but d(AB,BCA)= 3 >= d(AB,BA) +
d(BA,BCA) = 1 + 1 = 2.
Implementation Note: This class implements edit distance
using dynamic programming in time O(n*m) where
n and m are the length of the sequences
being compared. Using a sliding window of three lattice slices rather
than allocating the entire lattice at once, the space required is
that for three arrays of integers as long as the shorter of the two
character sequences being compared. For details, see section
12.1.1 of:
| Modifier and Type | Field and Description |
|---|---|
static Distance<CharSequence> |
NON_TRANSPOSING
Edit distance disallowing transposition.
|
static Distance<CharSequence> |
TRANSPOSING
Edit distance allowing transposition.
|
| Constructor and Description |
|---|
EditDistance(boolean allowTransposition)
Construct an edit distance with or without transposition based
on the specified flag.
|
| Modifier and Type | Method and Description |
|---|---|
double |
distance(CharSequence cSeq1,
CharSequence cSeq2)
Returns the edit distance between the specified character
sequences.
|
static int |
editDistance(CharSequence cSeq1,
CharSequence cSeq2,
boolean allowTransposition)
Returns the edit distance between the character sequences with
or without transpositions as specified.
|
double |
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the proximity between the character sequences.
|
String |
toString()
Returns a string representation of this edit distance.
|
public static final Distance<CharSequence> TRANSPOSING
public static final Distance<CharSequence> NON_TRANSPOSING
public EditDistance(boolean allowTransposition)
allowTransposition - Set to true to allow
transposition edits in the constructed distance.public double distance(CharSequence cSeq1, CharSequence cSeq2)
distance in interface Distance<CharSequence>cSeq1 - First character sequence.cSeq2 - Second character sequence.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
and thus proximities will all be negative or zero.proximity(cs1,cs2) = -distance(cs1,cs2)
proximity in interface Proximity<CharSequence>cSeq1 - First character sequence.cSeq2 - Second character sequence.public String toString()
public static int editDistance(CharSequence cSeq1, CharSequence cSeq2, boolean allowTransposition)
cSeq1 - First character sequence.cSeq2 - Second character sequence.allowTransposition - Set to true to allow
transposition edits.Copyright © 2016 Alias-i, Inc.. All rights reserved.