public class TrainSpellChecker extends Object implements ObjectHandler<CharSequence>, Compilable, Serializable
TrainSpellChecker instance provides a mechanism for
collecting training data for a compiled spell checker. Training
instances are nothing more than character sequences which represent
likely user queries.
In training the source language model, all training data is whitespace normalized with an initial whitespace, final whitespace, and all internal whitespace sequences converted to a single space character.
A tokenization factory may be optionally specified for training token-sensitive spell checkers. With tokenization, input is further normalized to insert a single whitespace between all tokens not already separated by a space in the input. The tokens are then output during compilation and read back into the compiled spell checker. The set of tokens output may be pruned to remove any below a given count threshold. The resulting set of tokens is used to constrain the set of alternative spellings suggested during spelling correction to include only tokens in the observed token set.
As an alternative to using the spell checker trainer, a language model may be trained directly and supplied in compiled form along with a weighted edit distance to the public constructors for compiled spell checkers. It's critical that the normalization happens the same way as for the spell checker trainer.
In constructing a spell checker trainer, a compilable weighted edit distance must be specified. This edit distance model will be compiled along with the language model and token set and used as the channel model in the compiled spell checker. The
After training, a model is written out through the
Compilable interface using compileTo(ObjectOutput). When this model is read back in, it
will be an instance of CompiledSpellChecker. The compiled
spell checkers allow many runtime parameters to be tuned; see the
class documentation for full details.
Warning: Unlike for serialization, the tokenizer factory
is not serialized along with the model during compilation.
After the compiled spell checker is read back in, use CompiledSpellChecker.setTokenizerFactory(TokenizerFactory) to set
up the tokenizer factory in the compiled model.
And then read back in by reversing this operation:TrainSpellChecker trainer = ...; ObjectOutput out = ...; out.writeObject(trainer);
ObjectInput in = ...; TrainSpellChecker trainer = (TrainSpellChecker) in.readObject();
The resulting round trip produces a trainer that is functionally identical to the original one. Serialization is useufl for storing models for which more training data will be available later.
Warning: The object input and output used for
serialization must extend InputStream and OutputStream. The only implementations of ObjectInput and
ObjectOutput as of the 1.6 JDK do extend the streams, so
this will only be a problem with customized object input or output
objects. If you need this method to work with custom input and
output objects that do not extend the corresponding streams, drop
us a line and we can perhaps refactor the output methods to remove
this restriction. [Note: This warning was inherited from NGramProcessLM.]
| Constructor and Description |
|---|
TrainSpellChecker(NGramProcessLM lm,
WeightedEditDistance editDistance)
Construct a non-tokenizing spell checker trainer from the
specified language model and edit distance.
|
TrainSpellChecker(NGramProcessLM lm,
WeightedEditDistance editDistance,
TokenizerFactory tokenizerFactory)
Construct a spell checker trainer from the specified n-gram
process language model, tokenizer factory and edit distance.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Writes a compiled spell checker to the specified object output.
|
WeightedEditDistance |
editDistance()
Returns the weighted edit distance (channel model) underlying this spell checker
trainer.
|
void |
handle(CharSequence cSeq)
Train the spell checker on the specified character sequence.
|
NGramProcessLM |
languageModel()
Returns the n-gram process language model (source model)
underlying this spell checker trainer.
|
long |
numTrainingChars()
Returns the total length in characters of all text used to
train the spell checker.
|
void |
pruneLM(int minCount)
Prunes the underlying character language model to remove
substring counts of less than the specified minimum.
|
void |
pruneTokens(int minCount)
Prunes the set of collected tokens of all tokens with count
less than the specified minimum.
|
ObjectToCounterMap<String> |
tokenCounter()
Returns the counter for the tokens in the training set.
|
void |
train(CharSequence cSeq,
int count)
Train the spelling checker on the specified character sequence
as if it had appeared with a frequency given by the specified
count.
|
public TrainSpellChecker(NGramProcessLM lm, WeightedEditDistance editDistance)
SpellChecker for more information on the language model and
edit distance models in the compiled spell checker.lm - Compilable language model.editDistance - Compilable weighted edit distance.IllegalArgumentException - If the edit distance is not
compilable.public TrainSpellChecker(NGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory tokenizerFactory)
null, in
which case tokens are not saved as part of training and the
compiled spell checker is not token sensitive. If the
tokenizer factory is specified, it must be compilable.lm - Compilable language model.editDistance - Compilable weighted edit distance.tokenizerFactory - Optional tokenizer factory.IllegalArgumentException - If the edit distance is not
compilable or if the tokenizer factory is non-null and not compilable.public NGramProcessLM languageModel()
The returned value is a reference to the language model held by the trainer, so any changes to it will affect this spell checker.
public WeightedEditDistance editDistance()
The returned value is a reference to the langauge model held by the trainer, so any changes to it will affect this spell checker.
public ObjectToCounterMap<String> tokenCounter()
public void train(CharSequence cSeq, int count)
See the method handle(CharSequence) for information
on the normalization carried out on the input character
sequence.
Although calling this method is equivalent to calling handle(CharSequence) the specified count number of times, this
mehod is much more efficient because it does not require
iteration.
This method may be used to boost the training for a specified input, or just to combine inputs into single method calls.
cSeq - Character sequence for training.count - Frequency of sequence to train.IllegalArgumentException - If the specified count is negative.public long numTrainingChars()
public void handle(CharSequence cSeq)
handle in interface ObjectHandler<CharSequence>cSeq - Characters for training.public void pruneTokens(int minCount)
minCount - Minimum count of preserved token.public void pruneLM(int minCount)
minCount - Minimum count of preserved substrings.public void compileTo(ObjectOutput objOut) throws IOException
CompiledSpellChecker.compileTo in interface CompilableobjOut - Object output to which this spell checker is
written.IOException - If there is an I/O error while writing.Copyright © 2019 Alias-i, Inc.. All rights reserved.