public class CompiledSpellChecker extends Object implements SpellChecker
CompiledSpellChecker class implements a first-best
spell checker based on models of what users are likely to mean and
what errors they are likely to make in expressing their meaning.
This class is based on a character language model which represents
likely user intentions, and a weighted edit distance, which
represents how noise is introduced into the signal via typos,
brainos, or other sources such as case-normalization, diacritic
removal, bad character encodings, etc.
The usual way of creating a compiled checker is through an
instance of TrainSpellChecker. The result of compiling the
spell checker training class and reading it back in is a compiled
spell checker. Only the basic models, weighted edit distance, and
token set are supplied through compilation; all
other parameters described below need to be set after an instance
is read in from its compiled form. The token set may be null
at construction time and may be set later.
This class adopts the noisy-channel model approach to decoding likely user intentions given received signals. Spelling correction simply returns the most likely intended message given the message actually received. In symbols:
didYouMean(received)
= ArgMaxintended
P(intended | received)
= ArgMaxintended
P(intended,received) / P(received)
= ArgMaxintended
P(intended,received)
= ArgMaxintended
P(intended) * P(received|intended)
The estimator P(intended), called the source
model, estimates which signals are likely to be sent along the
channel. For instance, the source might be a model of user's
intent in entering information on a web page. The estimator
P(received|intended), called the channel model,
estimates how intended messages are likely to be garbled.
For this class, the source language model must be a compiled n-gram character language model. Compiled models are required for the efficiency of their suffix-tree encodings in evaluating sequences of characters. Optimizing held-out sample cross-entropy is not necessarily the best approach to building these language models, because they are being used here in a discriminitive fashion, much as in language-model-based classification, tagging or chunking.
For this class, the channel model must be a weighted edit
distance. For traditional spelling correction, this is a model of
typos and brainos. There are two static constant weighted edit
distances supplied in this class which are useful for other
decoding tasks. The CASE_RESTORING distance may be used
to restore case in single-case text. The TOKENIZING model
may be used to tokenize untokenized text, and is used in our Chinese
tokenization demo.
All input is normalized for whitespace, which consists of removing initial and final whitespaces and reducing all other space sequences to a single space character. A single space character is used as the initial context for the source language model. A single final uneditable space character is estimated at the end of the language model, thus adapting the process language model to be used as a bounded sequence language model just as in the language model package itself.
This class optionally restricts corrections to sequences of
valid tokens. The valid tokens are supplied as a set either during
construction time or later. If the set of valid tokens is
null, then the output is not token sensitive, and
results may include tokens that are not in the training data.
Token-matching is case sensitive.
If a set of valid tokens is supplied, then a tokenizer factory
should also be supplied to carry out tokenization normalization on
input. This tokenizer factory will be used to separate input
tokens with single spaces. This tokenization may also be done
externally and normalized text passed into the
didYouMean method; this approach makes sense if the
tokenization is happening elsewhere already.
There are a number of tuning parameters for this class. The
coarsest form of tuning simply sets whether or not particular edits
may be performed. For instance, setAllowDelete(boolean)
is used to turn deletion on or off. Although edits with negative
infinity scores will never be used, it is more efficient to simply
disallow them if they are all infinite. This is used in the
Chinese tokenizer, for instance, to only allow insertions and
matches.
There are three scoring parameters that determine how expensive
input characters are to edit. The first of these is setKnownTokenEditCost(double), which provides a penalty to be
added to the cost of editing characters that fall within known
tokens. This value is only used for token-sensitive correctors.
Setting this to a low value makes it less likely to suggest an edit
on a known token. The default value is -2.0, which on a log (base
2) scale makes editing characters in known tokens 1/4 as likely
as editing characters in unknown tokens.
The next two scoring parameters provide penalties for editing
the first or second character in a token, whether it is known or
not. In most cases, users make more mistakes later in words than
in the first few characters. These values are controlled
independently through values provided at construction time or by
using the methods setFirstCharEditCost(double) and setSecondCharEditCost(double). The default values for these are
-2.0 and -1.0 respectively.
The final tuning parameter is controlled with setNumConsecutiveInsertionsAllowed(int), which determines how
many characters may be inserted in a row. The default value is 1,
and setting this to 2 or higher may seriously slow down correction,
especially if it not token sensitive.
Search is further controlled by an n-best parameter, which
specifies the number of ongoing hypotheses considered after
inspecting each character. This value is settable either in the
constructor or for models compiled from a trainer, by using the
method setNBest(int). This lower this value, the faster
the resulting spelling correction. But the danger is that with
low values, there may be search errors, where the correct
hypothesis is pruned because it did not look promising enough
early on. In general, this value should be set as low as
possible without causing search errors.
This class requires external concurrent-read/synchronous-write
(CRSW) synchronization. All of the methods begining with
set must be executed exclusively in order to guarantee
consistent results; all other methods may be executed concurrently.
The didYouMean(String) method for spelling correction may
be called concurrently with the same blocking and thread safety
constraints as the underlying language model and edit distance,
both of which are called repeatedly by this method. If both the
language model and edit distance are thread safe and non-blocking,
as in all of LingPipe's implementations, then
didYouMean will also be concurrently executable and
non-blocking.
There are two ways to block tokens from being edited. The first
is by setting a minimum length of edited tokens. Standard language
models trained on texts tend to overestimate the likelihood of
queries that contain well-known short words or phrases like
of or a. The method setMinimumTokenLengthToCorrect(int) sets a minimum token length
that is corrected. The default value is 0.
The second way to block corrections is to provide a set of
tokens that are never corrected. One way to construct such a set
during training is by taking large-count tokens from the counter
returned by TrainSpellChecker.tokenCounter().
Note that these methods are heuristics that move the spelling
corrector in the same direction as two existing parameters. First,
there is the pair of methods setFirstCharEditCost(double)
and setSecondCharEditCost(double) which make it less
likely to edit the first two characters (which are all of the
characters in a two-character token). Second, there is a flexible
penalty for editing known tokens that may be set with setKnownTokenEditCost(double).
Blocking corrections has a positive effect on speed, because it eliminates any search over the tokens that are excluded from correction.
didYouMeanNBest(String) returns an iterator over corrections in
decreasing order of likelihood. Note that the same exact string
may be proposed more than once as a correction because of
alternative edits leading to the same result. For instance,
"a" may be turned into "b" by substitution in one
step, or by deletion and insertion (or insertion then deletion)
in two steps. These alternatives typically have different scores
and only the highest-scoring one is maintained at any given stage
of the algorithm by the first-best analyzer.
The n-best analyzer needs a much wider n-best list in order to return sensible results, especially for very long inputs. The specified n-best size for the spell checker should, in fact, be substantially larger than the desired number of n-best results.
| Modifier and Type | Field and Description |
|---|---|
static WeightedEditDistance |
CASE_RESTORING
A weighted edit distance ordered by similarity that treats case
variants as zero cost and all other edits as infinite cost.
|
static WeightedEditDistance |
TOKENIZING
A weighted edit distance ordered by similarity that allows free
space insertion.
|
| Constructor and Description |
|---|
CompiledSpellChecker(CompiledNGramProcessLM lm,
WeightedEditDistance editDistance,
Set<String> tokenSet)
Construct a compiled spell checker based on the specified
language model and edit distance, with a null tokenizer
factory, the specified set of valid output tokens, with default
value for n-best size, known token edit cost and first and
second character edit costs.
|
CompiledSpellChecker(CompiledNGramProcessLM lm,
WeightedEditDistance editDistance,
Set<String> tokenSet,
int nBestSize)
Construct a compiled spell checker based on the specified
language model and edit distance, a null tokenizer factory, the
set of valid output tokens, and maximum n-best size, with
default known token and first and second character edit costs.
|
CompiledSpellChecker(CompiledNGramProcessLM lm,
WeightedEditDistance editDistance,
TokenizerFactory factory,
Set<String> tokenSet,
int nBestSize)
Construct a compiled spell checker based on the specified
language model and edit distance, tokenizer factory, the
set of valid output tokens, and maximum n-best size, with
default known token and first and second character edit costs.
|
CompiledSpellChecker(CompiledNGramProcessLM lm,
WeightedEditDistance editDistance,
TokenizerFactory factory,
Set<String> tokenSet,
int nBestSize,
double knownTokenEditCost,
double firstCharEditCost,
double secondCharEditCost)
Construct a compiled spell checker based on the specified
language model and similarity edit distance, set of valid
output tokens, maximum n-best size per character, and the
specified edit penalities for editing known tokens or the first
or second characters of tokens.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
allowDelete()
Returns
true if this spell checker allows
deletions. |
boolean |
allowInsert()
Returns
true if this spell checker allows
insertions. |
boolean |
allowMatch()
Returns
true if this spell checker allows
matches. |
boolean |
allowSubstitute()
Returns
true if this spell checker allows
substitutions. |
boolean |
allowTranspose()
Returns
true if this spell checker allows
transpositions. |
String |
didYouMean(String receivedMsg)
Returns a first-best hypothesis of the intended message given a
received message.
|
Iterator<ScoredObject<String>> |
didYouMeanNBest(String receivedMsg)
Returns an iterator over the n-best spelling corrections for
the specified input string.
|
Set<String> |
doNotEditTokens()
Returns an unmodifiable view of the set of tokens that will
never be edited in this compiled spell checker.
|
WeightedEditDistance |
editDistance()
Returns the weighted edit distance for this compiled spell
checker.
|
double |
firstCharEditCost()
Returns the cost penalty for editing the first character in a
token.
|
double |
knownTokenEditCost()
Returns the cost penalty for editing a character in a known
token.
|
CompiledNGramProcessLM |
languageModel()
Returns the compiled language model for this spell checker.
|
int |
minimumTokenLengthToCorrect()
Returns the minimum length of token that will be corrected.
|
int |
nBestSize()
Returns the n-best size for this spell checker.
|
int |
numConsecutiveInsertionsAllowed()
Returns the number of consecutive insertions allowed.
|
String |
parametersToString()
Returns a string-based representation of the parameters of
this compiled spell checker.
|
double |
secondCharEditCost()
Returns the cost penalty for editing the second character
in a token.
|
void |
setAllowDelete(boolean allowDelete)
Sets this spell checker to allow deletions if the specified
value is
true and to disallow them if it is
false. |
void |
setAllowInsert(boolean allowInsert)
Sets this spell checker to allow insertions if the specified
value is
true and to disallow them if it is
false. |
void |
setAllowMatch(boolean allowMatch)
Sets this spell checker to allow matches if the specified
value is
true and to disallow them if it is
false. |
void |
setAllowSubstitute(boolean allowSubstitute)
Sets this spell checker to allow substitutions if the specified
value is
true and to disallow them if it is
false. |
void |
setAllowTranspose(boolean allowTranspose)
Sets this spell checker to allow transpositions if the specified
value is
true and to disallow them if it is
false. |
void |
setDoNotEditTokens(Set<String> tokens)
Updates the set of do-not-edit tokens to be the specified
value.
|
void |
setEditDistance(WeightedEditDistance editDistance)
Sets the edit distance for this spell checker to the
specified value.
|
void |
setFirstCharEditCost(double cost)
Set the first character edit cost to the specified value.
|
void |
setKnownTokenEditCost(double cost)
Set the known token edit cost to the specified value.
|
void |
setLanguageModel(CompiledNGramProcessLM lm)
Sets the language model for this spell checker to the
specified value.
|
void |
setMinimumTokenLengthToCorrect(int tokenCharLength)
Sets a minimum character length for tokens to be eligible for
editing.
|
void |
setNBest(int size)
Sets The n-best size to the specified value.
|
void |
setNumConsecutiveInsertionsAllowed(int numAllowed)
Set the number of consecutive insertions allowed to the
specified value.
|
void |
setSecondCharEditCost(double cost)
Set the second character edit cost to the specified value.
|
void |
setTokenizerFactory(TokenizerFactory factory)
Sets the tokenizer factory for input processing to the
specified value.
|
void |
setTokenSet(Set<String> tokenSet)
Sets the set of tokens that can be produced by editing.
|
TokenizerFactory |
tokenizerFactory()
Returns the tokenizer factory for this spell checker.
|
Set<String> |
tokenSet()
Returns an unmodifiable view the set of tokens for this spell
checker.
|
public static WeightedEditDistance CASE_RESTORING
Double.NEGATIVE_INFINITY. See
WeightedEditDistance for more information on
similarity-based distances.
If this model is used for spelling correction, the result is a system that simply chooses the most likely case for output characters given an input character and does not change anything else.
Case here is based on the methods
Character.isUpperCase(char), Character.isLowerCase(char)
and equality is tested by converting the upper case character to
lower case using Character.toLowerCase(char).
This edit distance is compilable and the result of writing it and reading it is referentially equal to this instance.
public static WeightedEditDistance TOKENIZING
WeightedEditDistance for more information on
similarity-based distances.
If this model is used for spelling correction, the result is as system that will retokenize input with no spaces. For instance, if the source model is trained with chinese tokens separated by spaces and the input is a sequence of chinese characters not separated by spaces, the output is a space-separated tokenization. If the source model is valid pronunciations separated by spaces and the input is pronunciations not separated by spaces, the result is a tokenization.
This edit distance is compilable and the result of writing it and reading it is referentially equal to this instance.
public CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory factory, Set<String> tokenSet, int nBestSize, double knownTokenEditCost, double firstCharEditCost, double secondCharEditCost)
setDoNotEditTokens(Set).
The weighted edit distance is required to be a similarity
measure for compatibility with the order of log likelihoods in
the source (language) model. See WeightedEditDistance
for more information about similarity versus dissimilarity
distance measures.
If the set of tokens is null, the constructed
spelling checker will not be token-sensitive. That is, it
will allow edits to strings which are not tokens in the token set.
lm - Source language model.editDistance - Channel edit distance model.factory - Tokenizer factory for tokenizing inputs.tokenSet - Set of valid tokens for outputs or
null if output is not token sensitive.nBestSize - Size of n-best list for spell checking.
hypothesis is pruned.knownTokenEditCost - Penalty for editing known tokens per edit.firstCharEditCost - Penalty for editing while scanning the
first character in a token.secondCharEditCost - Penalty for editing while scanning
the second character in a token.public CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory factory, Set<String> tokenSet, int nBestSize)
setDoNotEditTokens(Set).lm - Source language model.editDistance - Channel edit distance model.factory - Tokenizer factory for tokenizing inputs.tokenSet - Set of valid tokens for outputs or
null if output is not token sensitive.nBestSize - Size of n-best list for spell checking.
hypothesis is pruned.IllegalArgumentException - If the edit distance is not a
similarity measure.public CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, Set<String> tokenSet, int nBestSize)
setDoNotEditTokens(Set).lm - Source language model.editDistance - Channel edit distance model.tokenSet - Set of valid tokens for outputs or
null if output is not token sensitive.nBestSize - Size of n-best list for spell checking.
hypothesis is pruned.public CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, Set<String> tokenSet)
setDoNotEditTokens(Set).lm - Source language model.editDistance - Channel edit distance model.tokenSet - Set of valid tokens for outputs or
null if output is not token sensitive.public CompiledNGramProcessLM languageModel()
public WeightedEditDistance editDistance()
public TokenizerFactory tokenizerFactory()
public Set<String> tokenSet()
setTokenSet(Set).public Set<String> doNotEditTokens()
setDoNotEditTokens(Set).public void setDoNotEditTokens(Set<String> tokens)
tokens - Set of tokens not to edit.public int nBestSize()
setNBest(int) for more information.public double knownTokenEditCost()
public double firstCharEditCost()
As a special case, transposition only pays a single penalty based on the penalty of the first character in the transposition.
public double secondCharEditCost()
public void setKnownTokenEditCost(double cost)
cost - New value for known token edit cost.public void setFirstCharEditCost(double cost)
cost - New value for the first character edit cost.public void setSecondCharEditCost(double cost)
cost - New value for the second character edit cost.public int numConsecutiveInsertionsAllowed()
public boolean allowInsert()
true if this spell checker allows
insertions.true if this spell checker allows
insertions.public boolean allowDelete()
true if this spell checker allows
deletions.true if this spell checker allows
deletions.public boolean allowMatch()
true if this spell checker allows
matches.true if this spell checker allows
matches.public boolean allowSubstitute()
true if this spell checker allows
substitutions.true if this spell checker allows
substitutions.public boolean allowTranspose()
true if this spell checker allows
transpositions.true if this spell checker allows
transpositions.public void setEditDistance(WeightedEditDistance editDistance)
editDistance - Edit distance to use for spell checking.public void setMinimumTokenLengthToCorrect(int tokenCharLength)
tokenCharLength - Edit distance to use for spell checking.IllegalArgumentException - If the character length
specified is less than 0.public int minimumTokenLengthToCorrect()
0, but may be set
using setMinimumTokenLengthToCorrect(int).public void setLanguageModel(CompiledNGramProcessLM lm)
lm - New language model for this spell checker.public void setTokenizerFactory(TokenizerFactory factory)
null, no
tokenization is performed on the input.factory - Tokenizer factory for this spell checker.public final void setTokenSet(Set<String> tokenSet)
null, editing will
not be token sensitive.
If the token set is null, nothing will happen.
Warning: Spelling correction without tokenization may be slow, especially with a large n-best size.
tokenSet - The new set of tokens or null if
not tokenizing.public void setNBest(int size)
size - Size of the n-best lists at each character.IllegalArgumentException - If the size is less than one.public void setAllowInsert(boolean allowInsert)
true and to disallow them if it is
false. If the value is false, then
the number of consecutive insertions allowed is also set
to zero.allowInsert - New insertion mode.public void setAllowDelete(boolean allowDelete)
true and to disallow them if it is
false.allowDelete - New deletion mode.public void setAllowMatch(boolean allowMatch)
true and to disallow them if it is
false.allowMatch - New match mode.public void setAllowSubstitute(boolean allowSubstitute)
true and to disallow them if it is
false.allowSubstitute - New substitution mode.public void setAllowTranspose(boolean allowTranspose)
true and to disallow them if it is
false.allowTranspose - New transposition mode.public void setNumConsecutiveInsertionsAllowed(int numAllowed)
true.numAllowed - Number of insertions allowed in a row.IllegalArgumentException - If the number specified is
less than zero.public String didYouMean(String receivedMsg)
null if the
received message is itself the best hypothesis. The exact
definition of hypothesis ranking is provided in the class
documentation above.didYouMean in interface SpellCheckerreceivedMsg - The message received over the noisy channel.public Iterator<ScoredObject<String>> didYouMeanNBest(String receivedMsg)
ScoredObject, the object of which is the corrected
string and the score of which is the joint score of edit (channel)
costs and language model (source) cost of the output.
Unlike for HMMs and chunking, this n-best list is not exact
due to pruning during spelling correction. The maximum number
of returned results is determined by the n-best paramemter, as
set through setNBest(int). The larger the n-best list,
the higher-quality the results, even earlier on the list.
N-best spelling correction is not an exact computation
due to heuristic pruning during decoding. Thus setting the
n-best list to a larger result may result in better n-best
results, even for earlier results on the list. For instance,
the result of the first five corrections is not necessarily the
same with a 5-element, 10-element or 1000-element n-best size
(as specified by setNBest(int).
A rough confidence measure may be determined by comparing the scores, which are log (base 2) edit (channel) plus log (base 2) language model (source) scores. A very crude measure is to compare the score of the first result to the score of the second result; if there is a large gap, confidence is high. A tighter measure is to convert the log probabilities back to linear, add them all up, and then divide. For instance, if there were results:
Here there are four results, with log probabilities -2, -3, -4 and -10, which have the corresponding linear probabilities. The sum of these probabilities is 0.438. Hence the confidence in the top-ranked answer is 0.250/0.438=0.571.
Rank String Log (2) Prob Prob Conf 0 foo -2 0.250 0.571 0 for -3 0.125 0.2850 food -4 0.062 0.1430 of -10 0.001 0.002
Warning: Spell checking with n-best output is currently implemented with a very naive algorithm and is thus very slow compared to first-best spelling correction. The reason for this is that there the dynamic programming is turned off for n-best spelling correction, hence a lot redundant computation is done.
receivedMsg - Input message.public String parametersToString()
Copyright © 2019 Alias-i, Inc.. All rights reserved.