public class SoundexTokenizerFactory extends ModifyTokenTokenizerFactory implements Serializable
SoundexTokenizerFactory modifies the output of a base
tokenizer factory to produce tokens in soundex representation.
Soundex replaces sequences of characters with a crude
four-character approximation of their pronunciation plus initial
letter.
The process for converting an input to its Soundex representation is fairly straighforward for inputs that are all ASCII letters. Soundex is case insensitive, but is only defined for strings of ASCII letters. Thus to begin, all characters that are not Latin1 letters are removed, and all Latin1 characters are stripped of their diacritics. The algorithm then proceeds according to its standard definition:
A, E, I, O, U, H, W, Y, continue.
0)
The table of individual character encodings is as follows:
Characters Code B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D, T 3 L 4 M, N 5 R 6
Here are some examples of translations from the unit tests, drawn from the sources cited below.
Tokens Soundex Encoding Notes Gutierrez G362 Pfister P236 Jackson J250 Tymczak T522 Ashcraft A261 Robert, Rupert R163 Euler, Ellery E460 Gauss, Ghosh G200 Hilbert, Heilbronn H416 Knuth, Kant K530 Lloyd, Liddy L300 Lukasiewicz, Lissajous L222 Wachs, Waugh W200 As a tokenizer filter, the
SoundexFilterTokenizersimply replaces each token with its Soundex equivalent. Note that this may produce very many0000outputs if it is fed standard text with punctuation, numbers, etc.Note: In order to produce a deterministic tokenizer filter, names with prefixes are coded with the prefix. Recall that Soundex considers the following set of words prefixes, and suggests providing both the Soundex computed with the prefix and the Soundex encoding computed without the prefix:
Van, Con, De, Di, La, LeThese are not accorded any special treatment by this implementation.
Thread Safety
An English stop-listed tokenizer factory is thread safe if its base tokenizer factory is thread safe.Serialization
An
EnglishStopTokenizerFactoryis serializable if its base tokenizer factory is serializable.References and Historical Notes
Soundex was invented and patented by Robert C. Russell in 1918. The original version involved eight categories, including one for vowels, without the initial character being treated specially as to coding. The first vowel was retained in the original Soundex. Furthermore, some positional information was added, such as the deletion of finalsandz.The version in this class is the one described by Donald Knuth in The Art of Computer Programming and the one described by the United States National Archives and Records Administration version, which has been used for the United States Census.
- Knuth, D. 1973. The Art of Computer Programming Volum 3: Sorting and Searching. Addison-Wesley. 2nd Edition Pages 394-395.
- Wikipedia. Soundex.
- United States National Archives and Records Administration. Using the Census Soundex. General Information Leaflet 55.
- Robert C. Russell. 1918. United States Patent 1,261,167.
- Robert C. Russell. 1922. United States Patent 1,435,663.
- Since:
- Lingpipe3.8
- Version:
- 4.0.1
- Author:
- Bob Carpenter
- See Also:
- Serialized Form
Constructor Summary
Constructors Constructor and Description SoundexTokenizerFactory(TokenizerFactory factory)Construct a Soundex-based tokenizer factory that converts tokens produced by the specified base factory into their soundex representations.
Method Summary
Methods Modifier and Type Method and Description StringmodifyToken(String token)Returns the Soundex encoding of the specified token.static StringsoundexEncoding(String token)Returns the Soundex encoding of the specified token.StringtoString()
Methods inherited from class com.aliasi.tokenizer.ModifyTokenTokenizerFactory
modify, modifyWhitespace
Methods inherited from class com.aliasi.tokenizer.ModifiedTokenizerFactory
baseTokenizerFactory, tokenizer
Constructor Detail
SoundexTokenizerFactory
public SoundexTokenizerFactory(TokenizerFactory factory)Construct a Soundex-based tokenizer factory that converts tokens produced by the specified base factory into their soundex representations.
- Parameters:
factory- Base tokenizer factory.
Method Detail
modifyToken
public String modifyToken(String token)Returns the Soundex encoding of the specified token.See the class documentation above for more information on the encoding.
- Overrides:
modifyTokenin classModifyTokenTokenizerFactory- Parameters:
token- Input token.- Returns:
- The soundex encoding of the input token.
toString
public String toString()
- Overrides:
toStringin classModifyTokenTokenizerFactoryCopyright © 2016 Alias-i, Inc.. All rights reserved.