public class UniformBoundaryLM extends Object implements LanguageModel.Dynamic, LanguageModel.Sequence
UniformBoundaryLM implements a uniform sequence
language model with a specified number of outcomes and the same
probability assigned to the end-of-stream marker. The formula
for computing sequence likelihood estimates is:
log2Estimate(cSeq) =
= log2 ( (cSeq.length()+1) / (numOutcomes+1) )
Adding one to the number of outcomes makes the end-of-sequence
just as likely as any other character. Adding one to the
sequence length adds the log likelihood of the end-of-sequence
marker itself.
This model is defined as dynamic for convenience. Calls to the training methods have no effect.
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized| Modifier and Type | Field and Description |
|---|---|
static UniformBoundaryLM |
ZERO_LM
A constant uniform boundary language model returning
zero log estimates.
|
| Constructor and Description |
|---|
UniformBoundaryLM()
Construct uniform boundary language model with the full set
of characters.
|
UniformBoundaryLM(double crossEntropyRate)
Create a constant uniform boundary LM with the specified
character cross-entropy rate.
|
UniformBoundaryLM(int numOutcomes)
Construct a uniform boundary language model with the specified
number of outcomes.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this model to the specified object
output.
|
void |
handle(CharSequence cs)
This method for training a character sequence is supplied
for compatibility with the dynamic language model interface,
but is implemented to do nothing.
|
double |
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.
|
double |
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.
|
int |
numOutcomes()
Returns the number of outcomes for this uniform model.
|
void |
train(char[] cs,
int start,
int end)
Ignores the training data.
|
void |
train(char[] cs,
int start,
int end,
int count)
Ignores the training data.
|
void |
train(CharSequence cSeq)
Ignores the training data.
|
void |
train(CharSequence cSeq,
int count)
Ignores the training data.
|
public static final UniformBoundaryLM ZERO_LM
This constant is particularly useful for removing the contribution of whitespace characters to token n-gram language models.
public UniformBoundaryLM()
public UniformBoundaryLM(int numOutcomes)
1/(numOutcomes+1).numOutcomes - Number of outcomes.public UniformBoundaryLM(double crossEntropyRate)
log2 P(cs)
= - crossEntropyRate * (cs.length() + 1)
The number of outcomes is set by rounding down the exponent of
the cross-entropy and subtracting one for the boundary
character:
numOutcomes = (int) 2.0crossEntropyRate - 1
Even if the above expression evaluates to less than zero, the
number of outcomes will then be rounded up to zero.crossEntropyRate - The cross-entropy rate of the model.IllegalArgumentException - If the cross-entropy rate is
not finite and non-negative.public int numOutcomes()
public void handle(CharSequence cs)
handle in interface ObjectHandler<CharSequence>cs - Ignored.public void compileTo(ObjectOutput objOut) throws IOException
UniformBoundaryLM.compileTo in interface CompilableobjOut - Object output to which this model is written.IOException - If there is an I/O error during the write.public void train(char[] cs,
int start,
int end)
train in interface LanguageModel.Dynamiccs - Ignored.start - Ignored.end - Ignored.public void train(char[] cs,
int start,
int end,
int count)
train in interface LanguageModel.Dynamiccs - Ignored.start - Ignored.end - Ignored.count - Ignored.public void train(CharSequence cSeq)
train in interface LanguageModel.DynamiccSeq - Ignored.public void train(CharSequence cSeq, int count)
train in interface LanguageModel.DynamiccSeq - Ignored.count - Ignored.public double log2Estimate(char[] cs,
int start,
int end)
LanguageModellog2Estimate in interface LanguageModelcs - Underlying array of characters.start - Index of first character in slice.end - One plus index of last character in slice.public double log2Estimate(CharSequence cSeq)
LanguageModellog2Estimate in interface LanguageModelcSeq - Character sequence to estimate.Copyright © 2016 Alias-i, Inc.. All rights reserved.