public class NGramProcessLM extends Object implements Model<CharSequence>, LanguageModel.Process, LanguageModel.Conditional, LanguageModel.Dynamic, ObjectHandler<CharSequence>, Serializable
NGramProcessLM provides a dynamic conditional
process language model process for which training, estimation, and
pruning may be interleaved. A process language model normalizes
probablities for a given length of input.
The model may be compiled to an object output stream; the
compiled model read back in will be an instance of CompiledNGramProcessLM.
This class implements a generative language model based on the
chain rule, as specified by LanguageModel.Conditional.
The maximum likelihood estimator (see CharSeqCounter),
is smoothed by linear interpolation with the next lower-order context
model:
P'(ck|cj,...,ck-1)
= lambda(cj,...,ck-1)
* PML(ck|cj,...,ck-1)
+ (1-lambda(cj,...,ck-1))
* P'(ck|cj+1,...,ck-1)
The PML terms in the above definition
are maximum likelihood estimates based on frequency:
ThePML(ck|cj,...,ck-1) = count(cj,...,ck-1, ck) / extCount(cj,...,ck-1)
count is just the number of times a given string
has been seein the data, whereas extCount is the number
of times an extension to the string has been seen in the data:
extCount(c1,...,cn) = Σd count(c1,...,cn,d)
In the parametric Witten-Bell method, the interpolation ratio
lambda is defined based on extensions of the context
of estimation to be:
lambda(c1,...,cn)
= extCount(c1,...,cn)
/ (extCount(c1,...,cn)
+ L * numExtensions(c1,...,cn))
where
c1,...,cn
is the conditioning context for estimation, extCount
is as defined above, where numExtensions is the
number of extensions of a context:
numExtensions(c1,...,cn)
= cardinality( { d | count(c1,...,cn,d) > 0 } )
and where L is a hyperparameter of the distribution
(described below).
As a base case, P(ck) is
interpolated with the uniform distribution
PU, with interpolation defined
as usual with the argument to lambda being the
empty (i.e. zero length) sequence:
P(d) = lambda() * PML(d)
+ (1-lambda()) * PU(d)
The uniform distribution PU only
depends on the number of possible characters used in training and
tests:
wherePU(c) = 1/alphabetSize
alphabetSize is the maximum number of distinct
characters in this model.
The free hyperparameter L in the smoothing equation
determines the balance between higher-order and lower-order models.
A higher value for L gives more of the weight to
lower-order contexts. As the amount of data grows against a fixed
alphabet of characters, the impact of L is reduced.
In Witten and Bell's original paper, the hyperparameter
L was set to 1.0, which is not a particularly good
choice for most text sources. A value of the lambda factor that is
roughly the length of the longest n-gram seems to be a good rule of
thumb.
Methods are provided for computing a sample cross-entropy rate
for a character sequence. The sample cross-entropy
H(c1,...,cn;PM) for
sequence c1,...,cn in
probability model PM is defined to be the
average log (base 2) probability of the characters in the sequence
according to the model. In symbols:
H(c1,...,cn;PM)
= (-log2 PM(c1,...,cn))/n
The cross-entropy rate of distribution P'
with respect to a distribution P is defined by:
H(P',P)
= Σx
P(x) * log2 P'(x)
The Shannon-McMillan-Breiman theorem shows that as the length of
the sample drawn from the true distribution P grows,
the sample cross-entropy rate approaches the actual cross-entropy
rate. In symbols:
H(P,PM)
= limn->infinity
H(c1,...,cn;PM)/n
The entropy of a distribution P is defined by its
cross-entropy against itself, H(P,P). A
distribution's entropy is a lower bound on its cross-entropy; in
symbols, H(P',P) > H(P,P) for all distributions
P'.
Models may be pruned by pruning the underlying substring
counter for the language model. This counter is returned by
the method substringCounter(). See the class documentat
for the return result TrieCharSeqCounter for more information.
Models may be serialized in the usual way by creating an object output object and writing the object:
Reading just reverses the process:NGramProcessLM lm = ...; ObjectOutput out = ...; out.writeObject(lm);
Serialization is based on the methodsObjectInput in = ...; NGramProcessLM lm = (NGramProcessLM) in.readObject();
writeTo(OutputStream)
and readFrom(InputStream). These write compressed forms of
the model to streams in binary format.
Warning: The object input and output used for
serialization must extend InputStream and OutputStream. The only implementations of ObjectInput and
ObjectOutput as of the 1.6 JDK do extend the streams, so
this will only be a problem with customized object input or output
objects. If you need this method to work with custom input and
output objects that do not extend the corresponding streams, drop
us a line and we can perhaps refactor the output methods to remove
this restriction.
For information on the Witten-Bell interpolation method, see:
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized| Constructor and Description |
|---|
NGramProcessLM(int maxNGram)
Constructs an n-gram process language model with the specified
maximum n-gram length.
|
NGramProcessLM(int numChars,
double lambdaFactor,
TrieCharSeqCounter counter)
Construct an n-gram process language model with the specified
number of characters, interpolation parameter
and character sequence counter.
|
NGramProcessLM(int maxNGram,
int numChars)
Construct an n-gram language process with the specified maximum
n-gram length and maximum number of characters.
|
NGramProcessLM(int maxNGram,
int numChars,
double lambdaFactor)
Construct an n-gram language process with the specified maximum
n-gram length, number of characters, and interpolation ratio
hyperparameter.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this process language model to the
specified object output.
|
double |
getLambdaFactor()
Returns the current setting of the interpolation ratio
hyperparameter.
|
void |
handle(CharSequence cSeq)
Implements the object handler interface over character
sequences for training.
|
double |
log2ConditionalEstimate(char[] cs,
int start,
int end)
Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters.
|
double |
log2ConditionalEstimate(char[] cs,
int start,
int end,
int maxNGram,
double lambdaFactor)
Returns the log (base 2) conditional estimate for a specified
character slice with a specified maximum n-gram and specified
hyperparameter.
|
double |
log2ConditionalEstimate(CharSequence cSeq)
Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters.
|
double |
log2ConditionalEstimate(CharSequence cSeq,
int maxNGram,
double lambdaFactor)
Returns the log (base 2) conditional estimate of the last
character in the specified character sequence given the
previous characters based only on counts of n-grams up to the
specified maximum n-gram.
|
double |
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.
|
double |
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.
|
double |
log2Prob(CharSequence cSeq)
Returns the log (base 2) of the probability of
the specified object.
|
int |
maxNGram()
Returns the maximum n-gram length for this model.
|
char[] |
observedCharacters()
Returns the array of characters that have been observed
for this model.
|
double |
prob(CharSequence cSeq)
Returns the probability of the specified object.
|
static NGramProcessLM |
readFrom(InputStream in)
Reads a language model from the specified input stream.
|
void |
setLambdaFactor(double lambdaFactor)
Sets the value of the interpolation ratio hyperparameter
to the specified value.
|
void |
setNumChars(int numChars)
Sets the number of characters for this language model.
|
TrieCharSeqCounter |
substringCounter()
Returns the substring counter for this language model.
|
String |
toString()
Returns a string-based representation of this language model.
|
void |
train(char[] cs,
int start,
int end)
Update the model with the training data provided by
the specified character slice.
|
void |
train(char[] cs,
int start,
int end,
int incr)
Update the model with the training data provided by the
specified character sequence with the specifiedc count.
|
void |
train(CharSequence cSeq)
Update the model with the training data provided by the
specified character sequence with a count of one.
|
void |
train(CharSequence cSeq,
int incr)
Update the model with the training data provided by the
specified character sequence with the specified count.
|
void |
trainConditional(char[] cs,
int start,
int end,
int condEnd)
Trains the specified conditional outcome(s) of the specified
character slice given the background slice.
|
void |
writeTo(OutputStream out)
Writes this language model to the specified output stream.
|
public NGramProcessLM(int maxNGram)
Character.MAX_VALUE and the interpolation
parameter is set to the default of being equal to the n-gram
length.maxNGram - Maximum length n-gram for which counts are
stored.public NGramProcessLM(int maxNGram,
int numChars)
maxNGram - Maximum length n-gram for which counts are
stored.numChars - Maximum number of characters in training data.IllegalArgumentException - If the maximum n-gram is
less than 1 or if the number of characters is not between 1 and
the maximum number of characters.public NGramProcessLM(int maxNGram,
int numChars,
double lambdaFactor)
maxNGram - Maximum length n-gram for which counts are
stored.numChars - Maximum number of characters in training data.lambdaFactor - Central value of interpolation
hyperparameter explored.IllegalArgumentException - If the maximum n-gram is
less than 1, the number of characters is not between 1 and
the maximum number of characters, of if the lambda factor
is not greater than or equal to 0.public NGramProcessLM(int numChars,
double lambdaFactor,
TrieCharSeqCounter counter)
The counter argument allows serialized counters to be read back in and used to create an n-gram process LM.
numChars - Maximum number of characters in training and
test data.lambdaFactor - Interpolation parameter (see class doc).counter - Character sequence counter to use.IllegalArgumentException - If the number of characters is
not between 1 and the maximum number of characters, of if the
lambda factor is not greater than or equal to 0.public void writeTo(OutputStream out) throws IOException
A language model is written using a BitOutput
wrapped around the specified output stream. This bit output is
used to delta encode the maximum n-gram, number of characters,
lambda factor times 1,000,000, and then the underlying sequence
counter using TrieCharSeqCounter.writeCounter(CharSeqCounter,TrieWriter,int).
The bit output is flushed, but the output stream is not closed.
A language model can be read and written using the following
code, given a file f:
NGramProcessLM lm = ...; File f = ...; OutputStream out = new FileOutputStream(f); BufferedOutputStream bufOut = new BufferedOutputStream(out); lm.writeTo(bufOut); bufOut.close(); ... InputStream in = new FileInputStream(f); BufferedInputStream bufIn = new BufferedInputStream(in); NGramProcessLM lm2 = NGramProcessLM.readFrom(bufIn); bufIn.close();
out - Output stream to which to write language model.IOException - If there is an underlying I/O error.public static NGramProcessLM readFrom(InputStream in) throws IOException
See writeTo(OutputStream) for information on the
binary I/O format.
in - Input stream from which to read a language model.IOException - If there is an underlying I/O error.public double log2Prob(CharSequence cSeq)
Modellog2Prob in interface Model<CharSequence>cSeq - The object whose probability is returned.public double prob(CharSequence cSeq)
Modelprob in interface Model<CharSequence>cSeq - The object whose probability is returned.public final double log2Estimate(CharSequence cSeq)
LanguageModellog2Estimate in interface LanguageModelcSeq - Character sequence to estimate.public final double log2Estimate(char[] cs,
int start,
int end)
LanguageModellog2Estimate in interface LanguageModelcs - Underlying array of characters.start - Index of first character in slice.end - One plus index of last character in slice.public void train(CharSequence cSeq)
LanguageModel.Dynamictrain in interface LanguageModel.DynamiccSeq - The character sequence to use as training data.public void train(CharSequence cSeq, int incr)
LanguageModel.Dynamictrain(cs,n) is equivalent
to calling train(cs) a total of n
times.train in interface LanguageModel.DynamiccSeq - The character sequence to use as training data.incr - Number of instances to train.public void train(char[] cs,
int start,
int end)
LanguageModel.Dynamictrain in interface LanguageModel.Dynamiccs - The underlying character array for the slice.start - Index of first character in the slice.end - Index of one plus the last character in the
training slice.public void train(char[] cs,
int start,
int end,
int incr)
LanguageModel.Dynamictrain(cs,n) is equivalent
to calling train(cs) a total of
n times.
Update the model with the training data provided by
the specified character slice.train in interface LanguageModel.Dynamiccs - The underlying character array for the slice.start - Index of first character in the slice.end - Index of one plus the last character in the
training slice.incr - Number of instances to train.public void handle(CharSequence cSeq)
train(CharSequence).handle in interface ObjectHandler<CharSequence>cSeq - Character sequence on which to train.public void trainConditional(char[] cs,
int start,
int end,
int condEnd)
This method just shorthand for incrementing the counts of
all substrings of cs from position
start to end-1 inclusive, then
decrementing all of the counts of substrings from position
start to condEnd-1. For instance, if
cs is
"abcde".toCharArray(), then calling
trainConditional(cs,0,5,2) will increment the
counts of cde given ab, but will not
increment the counts of ab directly. This increases
the following probabilities:
P('e'|"abcd")
P('e'|"bcd")
P('e'|"cd")
P('e'|"d")
P('e'|"")
P('d'|"abc")
P('d'|"bc")
P('d'|"c")
P('d'|"")
P('c'|"ab")
P('c'|"b")
P('c'|"")
but does not increase the following probabilities:
P('b'|"a")
P('b'|"")
P('a'|"")cs - Array of characters.start - Start position for slice.end - One past end position for slice.condEnd - One past the end of the conditional portion of
the slice.public char[] observedCharacters()
LanguageModel.ConditionalobservedCharacters in interface LanguageModel.Conditionalpublic void compileTo(ObjectOutput objOut) throws IOException
The object written will be an instance of CompiledNGramProcessLM. It may be read in by casting the
result of ObjectInput.readObject().
Compilation is time consuming, because it must traverse the entire trie structure, and for each node, estimate its log probability and if it is internal, its log interpolation value. Given that time taken is proportional to the size of the trie, pruning first may greatly speed up this operation and reduce the size of the compiled object that is written.
compileTo in interface CompilableobjOut - Object output to which a compiled version of this
langauge model will be written.IOException - If there is an I/O exception writing the
compiled object.public double log2ConditionalEstimate(CharSequence cSeq)
LanguageModel.Conditionallog2ConditionalEstimate in interface LanguageModel.ConditionalcSeq - Character sequence to estimate.public double log2ConditionalEstimate(char[] cs,
int start,
int end)
LanguageModel.Conditionallog2ConditionalEstimate in interface LanguageModel.Conditionalcs - Underlying array of characters.start - Index of first character in slice.end - One plus the index of the last character in the slice.public TrieCharSeqCounter substringCounter()
public int maxNGram()
public double log2ConditionalEstimate(CharSequence cSeq, int maxNGram, double lambdaFactor)
cSeq - Character sequence to estimate.maxNGram - Maximum length of n-gram count to use for
estimate.lambdaFactor - Value of interpolation hyperparameter for
this estimate.IllegalArgumentException - If the character sequence is not at
least one character long.public double log2ConditionalEstimate(char[] cs,
int start,
int end,
int maxNGram,
double lambdaFactor)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.maxNGram - Maximum length of n-gram to use in estimates.lambdaFactor - Value of interpolation hyperparameter.IndexOutOfBoundsException - If the start index and end
index minus one are out of range of the character array or if the
character slice is less than one character long.public double getLambdaFactor()
public final void setLambdaFactor(double lambdaFactor)
lambdaFactor - New value for interpolation ratio
hyperparameter.IllegalArgumentException - If the value is not greater
than or equal to zero.public final void setNumChars(int numChars)
numChars - New number of characters for this language model.IllegalArgumentException - If the number of characters is
less than 0 or more than
Character.MAX_VALUE.Copyright © 2016 Alias-i, Inc.. All rights reserved.