public class ApproxDictionaryChunker extends Object implements Chunker, Serializable
ApproxDictionaryChunker implements a chunker that
produces chunks based on weighted edit distance of strings from
dictionary entries. This is an approximate or "fuzzy"
dictionary matching strategy.
The underlying dictionary is required to be an instance of
TrieDictionary in order to support efficient search for
matches. Other dictionaries can be easily converted to
trie dictionaries by adding their entries to a fresh trie
dictionary.
Entries are matched by weighted edit distance, as supplied by an
implementation of WeightedEditDistance. All substrings
within the maximum distance specified at construction time are
returned as part of the chunking. Keep in mind that weights for
weighted edit distance are specified as proximities, that is, as
negative distances.
Transposition is not implemented in the approximate dictionary chunker, so no matches are possible through transposition. Specifically, the transpose weight method is never called on the underlying weighted edit distance.
The tokenizer factory supplied at construction time is only used to constrain search by enforcing boundary conditions. Chunks are only returned if they start on the first character of a token and end on the last character of a token.
Using an instance of CharacterTokenizerFactory effectively removes
token sensitivity by treating every non-whitespace character as a
token and thus rendering every non-whitespace position a possible
chunk boundary.
ApproxDictionaryChunker.
The approach implemented here is very similar to that described in the following paper:
| Modifier and Type | Field and Description |
|---|---|
static WeightedEditDistance |
TT_DISTANCE
This is a weighted edit distance defined by Tsuruoka and Tsujii
for matching protein names in biomedical texts.
|
| Constructor and Description |
|---|
ApproxDictionaryChunker(TrieDictionary<String> dictionary,
TokenizerFactory tokenizerFactory,
WeightedEditDistance editDistance,
double distanceThreshold)
Construct an approximate dictionary chunker from the specified
dictionary, tokenizer factory, weighted edit distance and
distance bound.
|
| Modifier and Type | Method and Description |
|---|---|
Chunking |
chunk(char[] cs,
int start,
int end)
Return the approximate dictionary-based chunking for the
specified character sequence.
|
Chunking |
chunk(CharSequence cSeq)
Return the approximate dictionary-based chunking for
the specified character sequence.
|
TrieDictionary<String> |
dictionary()
Returns the trie dictionary underlying this chunker.
|
double |
distanceThreshold()
Returns the maximum edit distance a string can be from a
dictionary entry in order to be returned by this chunker.
|
WeightedEditDistance |
editDistance()
Returns the weighted edit distance for matching with
this chunker.
|
void |
setMaxDistance(double distanceThreshold)
Set the max distance a string can be from a dictionary entry
in order to be returned as a chunk by this chunker.
|
TokenizerFactory |
tokenizerFactory()
Returns the tokenizer factory for matching with this
chunker.
|
public static final WeightedEditDistance TT_DISTANCE
Tsuruoka and Tsujii's paper is available online:
Operation Character Cost Insertion space or hyphen -10 other characters -100 Deletion space or hyphen -10 other characters -100 Substitution space for hyphen -10 digit for other digit -10 capital for lowercase -10 other characters -50 Match any character 0 Transposition any characters Double.NEGATIVE_INFINITY Tsuruoka and Tsujii's Weighted Edit Distance
Yoshimasa Tsuruoka and Jun'ichi Tsujii. 2003. Boosting precision and recall of dictionary-based protein name recognition In Proceedings of the 2003 ACL workshop on NLP in Biomedicine.
public ApproxDictionaryChunker(TrieDictionary<String> dictionary, TokenizerFactory tokenizerFactory, WeightedEditDistance editDistance, double distanceThreshold)
dictionary - Dictionary to use for matching.tokenizerFactory - Tokenizer factory for boundary
determination.editDistance - Matching distance measure.distanceThreshold - Distance threshold for matching.public TrieDictionary<String> dictionary()
public WeightedEditDistance editDistance()
public TokenizerFactory tokenizerFactory()
public double distanceThreshold()
setMaxDistance(double).public void setMaxDistance(double distanceThreshold)
public Chunking chunk(CharSequence cSeq)
public Chunking chunk(char[] cs, int start, int end)
chunk in interface Chunkercs - Underlying characters.start - Index of first character in the array.end - Index of one past the last character in the array.IllegalArgumentException - If the indices are out of
bounds in the character sequence.Copyright © 2019 Alias-i, Inc.. All rights reserved.