public class NGramBoundaryLM extends Object implements LanguageModel.Sequence, LanguageModel.Conditional, LanguageModel.Dynamic, Model<CharSequence>, Compilable, Serializable
NGramBoundaryLM provides a dynamic sequence
language model for which training, estimation and pruning may be
interleaved. A sequence language model normalizes probabilities
over all sequences.
This class wraps an n-gram process language model by supplying a
special boundary character boundaryChar at
construction time which will be added to the total number of
characters in defining the estimator. For each training event, the
boundary character is inserted both before and after the character
sequence provided. The actual unigram count of this boundary must
then be decremented so that the initial character isn't counted in
estimates. During estimation, the initial boundary character is
used as context and the final one is used to estimate the
end-of-stream likelihood. Thus if Ppr
is the underlying process model then the boundary model defines
estimates by:
Pb(c1,...,cN)
= Ppr(boundaryChar|boundaryChar,c1,...,cN)
* Σ1<=i<=N
Ppr(ci|boundaryChar,c1,...,ci-1)
= Ppr(boundaryChar,c1,...,cN,boundaryChar)
- Ppr(boundaryChar)
The result of serializing and deserializing an n-gram boundary
language model is a compiled implementation of a conditional
sequence language model. The serialization format is the boundary character
followed by the serialization of the contained writable process
language model.
Models may be pruned by pruning the substring counter returned
by substringCounter(). See the documentation for the
class of the return object, TrieCharSeqCounter, for more
information.
Serializable interface and
LingPipe's Compilable interface.
Serialization and deserialization returns a copy of the
serialized object, which again implements this class, NGramBoundaryLM. Compilation and deserialization returns an
instance of CompiledNGramBoundaryLM. The compiled version
is much faster and may also be more compact in memory.
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized| Constructor and Description |
|---|
NGramBoundaryLM(int maxNGram)
Constructs a dynamic n-gram sequence language model with the
specified maximum n-gram and default values for other
parameters.
|
NGramBoundaryLM(int maxNGram,
int numChars)
Constructs a dynamic n-gram sequence language model with the
specified maximum n-gram, specified maximum number of observed
characters, and default values for other parameters.
|
NGramBoundaryLM(int maxNGram,
int numChars,
double lambdaFactor,
char boundaryChar)
Construct a dynamic n-gram sequence language model with the
specified maximum n-gram length, number of characters,
interpolation ratio hyperparameter and boundary character.
|
NGramBoundaryLM(NGramProcessLM processLm,
char boundaryChar)
Construct an n-gram boundary language model with the specified
boundary character and underlying process language model.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this boundary language model to
the specified object output.
|
NGramProcessLM |
getProcessLM()
Returns the underlying n-gram process language model
for this boundary language model.
|
void |
handle(CharSequence cSeq)
Train the language model on the specified character sequence.
|
double |
log2ConditionalEstimate(char[] cs,
int start,
int end)
Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters.
|
double |
log2ConditionalEstimate(CharSequence cs)
Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters.
|
double |
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.
|
double |
log2Estimate(CharSequence cs)
Returns an estimate of the log (base 2) probability of the
specified character sequence.
|
double |
log2Prob(CharSequence cSeq)
This method is a convenience impelementation of the
Model interface which delegates the call to log2Estimate(CharSequence). |
char[] |
observedCharacters()
Returns the characters that have been observed for this
language model, including the special boundary character.
|
double |
prob(CharSequence cSeq)
This method is a convenience implementation of the
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence). |
static NGramBoundaryLM |
readFrom(InputStream in)
Read a process language model from the specified input
stream.
|
TrieCharSeqCounter |
substringCounter()
Returns the underlying substring counter for this language
model.
|
String |
toString()
Returns a string-based representation of this language model.
|
void |
train(char[] cs,
int start,
int end)
Update the model with the training data provided by
the specified character slice.
|
void |
train(char[] cs,
int start,
int end,
int count)
Update the model with the training data provided by the
specified character sequence with the specifiedc count.
|
void |
train(CharSequence cs)
Update the model with the training data provided by the
specified character sequence with a count of one.
|
void |
train(CharSequence cs,
int count)
Update the model with the training data provided by the
specified character sequence with the specified count.
|
void |
writeTo(OutputStream out)
Writes this language model to the specified output stream.
|
public NGramBoundaryLM(int maxNGram)
The default number of characters is Character.MAX_VALUE-1, the default interpolation
parameter ratio is equal to the n-gram length, and the boundary
character is the byte-order marker U+FFFF
maxNGram - Maximum n-gram length in model.public NGramBoundaryLM(int maxNGram,
int numChars)
The default interpolation
parameter ratio is equal to the n-gram length, and the boundary
character is the byte-order marker U+FFFF
maxNGram - Maximum n-gram length in model.numChars - Maximum number of character seen in training
and test sets.public NGramBoundaryLM(int maxNGram,
int numChars,
double lambdaFactor,
char boundaryChar)
U+FFFF or U+FEFF may be used
internally by applications but may not be part of valid unicode
character streams and thus make ideal choices for boundary
characters. See:
Unicode Standard, Chapter 15.8: NonCharactersmaxNGram - Maximum n-gram length in model.numChars - Maximum number of character seen in training
and test sets.lambdaFactor - Interpolation ratio hyperparameter.boundaryChar - Boundary character.public NGramBoundaryLM(NGramProcessLM processLm, char boundaryChar)
This constructor may be used to reconstitute a serialized model. By writing the trie character sequence counter for the underlying process language model, it may be read back in. This may be used to construct a process language model, which may be used to reconstruct a boundary language model using this constructor.
processLm - Underlying process language model.boundaryChar - Character used to encode boundaries.public void writeTo(OutputStream out) throws IOException
A bit output is wrapped around the output stream for writing. The format begins with a delta-encoding of the boundary character plus 1, and is followed by the bit output of the underlying process language model.
out - Output stream from which to read the language model.IOException - If there is an underlying I/O error.public static NGramBoundaryLM readFrom(InputStream in) throws IOException
See writeTo(OutputStream) for a description
of the binary format.
in - Input stream from which to read the model.IOException - If there is an underlying I/O error.public NGramProcessLM getProcessLM()
public char[] observedCharacters()
observedCharacters in interface LanguageModel.Conditionalpublic TrieCharSeqCounter substringCounter()
public void compileTo(ObjectOutput objOut) throws IOException
ObjectInput.readObject() to
CompiledNGramBoundaryLM.compileTo in interface CompilableobjOut - Object output to which this model is compiled.IOException - If there is an I/O exception during the
write.public void handle(CharSequence cSeq)
train(CharSequence).handle in interface ObjectHandler<CharSequence>cSeq - Character sequence on which to train.public void train(CharSequence cs, int count)
LanguageModel.Dynamictrain(cs,n) is equivalent
to calling train(cs) a total of n
times.train in interface LanguageModel.Dynamiccs - The character sequence to use as training data.count - Number of instances to train.public void train(CharSequence cs)
LanguageModel.Dynamictrain in interface LanguageModel.Dynamiccs - The character sequence to use as training data.public void train(char[] cs,
int start,
int end)
LanguageModel.Dynamictrain in interface LanguageModel.Dynamiccs - The underlying character array for the slice.start - Index of first character in the slice.end - Index of one plus the last character in the
training slice.public void train(char[] cs,
int start,
int end,
int count)
LanguageModel.Dynamictrain(cs,n) is equivalent
to calling train(cs) a total of
n times.
Update the model with the training data provided by
the specified character slice.train in interface LanguageModel.Dynamiccs - The underlying character array for the slice.start - Index of first character in the slice.end - Index of one plus the last character in the
training slice.count - Number of instances to train.public double log2ConditionalEstimate(CharSequence cs)
LanguageModel.Conditionallog2ConditionalEstimate in interface LanguageModel.Conditionalcs - Character sequence to estimate.public double log2ConditionalEstimate(char[] cs,
int start,
int end)
LanguageModel.Conditionallog2ConditionalEstimate in interface LanguageModel.Conditionalcs - Underlying array of characters.start - Index of first character in slice.end - One plus the index of the last character in the slice.public double log2Estimate(CharSequence cs)
LanguageModellog2Estimate in interface LanguageModelcs - Character sequence to estimate.public double log2Estimate(char[] cs,
int start,
int end)
LanguageModellog2Estimate in interface LanguageModelcs - Underlying array of characters.start - Index of first character in slice.end - One plus index of last character in slice.public double log2Prob(CharSequence cSeq)
Model interface which delegates the call to log2Estimate(CharSequence).log2Prob in interface Model<CharSequence>cSeq - Character sequence whose probability is returned.public double prob(CharSequence cSeq)
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence).prob in interface Model<CharSequence>cSeq - Character sequence whose probability is returned.Copyright © 2019 Alias-i, Inc.. All rights reserved.