Interface SMStringLevenshtein
-
- All Superinterfaces:
SimilarityMeasure,SMString
- All Known Implementing Classes:
SMStringLevenshteinImpl
public interface SMStringLevenshtein extends SMString
Compares two strings using the Levenshtein algorithm. The comparsion can be case sensitive or insensitive.Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
- If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
- If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are. The Worst case is O(nd)-time, average case O(n+d2)-time algorithm for edit-distance, where d is the edit-distance between the two strings.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
Similarity
The similarity between s and t is defined as sim(s,t) = LD(s,t) / max(length(s),length(t))
Online References
Other discussions of Levenshtein distance are:
- Michael Gilleland, Merriam Park Software, Levenshtein Distance, in Three Flavors
- Lloyd Allison, Dynamic Programming Algorithm (DPA) for Edit-Distance
- Alex Bogomolny, Distance Between Strings
- Thierry Lecroq, Levenshtein Distance
Paper References
- V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals.
Doklady Akademii Nauk SSSR 163(4) p845-848, 1965, also Soviet Physics Doklady 10(8)
p707-710, Feb 1966.
Discovered the basic DPA for edit distance. - S. B. Needleman and C. D. Wunsch. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Jrnl Molec. Biol. 48 p443-453,
1970.
Defined a similarity score on molecular-biology sequences, with an O(n2) algorithm that is closely related to those discussed here. - Hirschberg (1975) presented a method of recovering an alignment (of an LCS) in O(n2) time but in only linear, O(n)-space; see [here].
- E. Ukkonen On approximate string matching. Proc. Int. Conf. on Foundations of Comp. Theory,
Springer-Verlag, LNCS 158 p487-495, 1983.
- Author:
- Rainer Maximini
-
-
Field Summary
Fields Modifier and Type Field Description static booleanDEFAULT_CASE_SENSITIVEThe default value for case sensitive is true.static intDEFAULT_THRESHOLDThe default threshold value is -1.static StringNAMEName of similarity measure is "StringLevenshtein".-
Fields inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure
COMPONENT, COMPONENT_KEY, LOG_ORDER_NAME_NOT_FOUND
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description intgetThreshold()booleanisCaseInsensitive()booleanisCaseSensitive()voidsetCaseInsensitive()voidsetCaseSensitive()voidsetThreshold(int threshold)-
Methods inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure
compute, getDataClass, getName, getSystemName, isForceOverride, isReusable, setForceOverride
-
-
-
-
Field Detail
-
NAME
static final String NAME
Name of similarity measure is "StringLevenshtein".- See Also:
- Constant Field Values
-
DEFAULT_CASE_SENSITIVE
static final boolean DEFAULT_CASE_SENSITIVE
The default value for case sensitive is true.- See Also:
- Constant Field Values
-
DEFAULT_THRESHOLD
static final int DEFAULT_THRESHOLD
The default threshold value is -1.- See Also:
- Constant Field Values
-
-