public class CompiledNGramProcessLM extends Object implements LanguageModel.Process, LanguageModel.Conditional, Model<CharSequence>
CompiledNGramProcessLM implements a conditional
process language model. Instances are constructed by reading a
serialized version of an instance of NGramProcessLM through
data input.
Compiled models contain precompulted estimates and smoothing parameters. For instance, consider an 3-gram process language model trained on the string "abracadabra", which yields the following counts:
11
a 5
br 2
ca 1
da 1
bra 2
cad 1
dab 1
r 2
a 2
c 1
For instance, count("")=11,
count("a")=5,
count("ab")=2,
count("abr")=2,
count("br")=2,
count("bra")=2,
count("r")=2, and
count("ra")=2, and
count("rac")=1.
Assuming a lambda factor hyperparameter set to 4.0, the compiled
model consists of the following five parallel arrays:
The actual data is contained in the last five columns of the table. Thus internal nodes require 10 bytes and terminal nodes require 18 bytes. The indices in the first column and the implicit context represented by the node are for convenience. Each of these indices picks out a row in the table that corresponds to a node in the trie structure representing the counts. The nodes are arranged according to a unicode-order breadth-first traversal of the count trie. Each non-terminal node stores a character, the index of the suffix node (as in a suffix tree), a probability estimate for the character given the context of characters that led to the character, an interpolation value for the context represented by the context leading to the character plus the character itself, and finally a pointer to the first daughter of the node in the array. Note that the last daughter of a node is thus picked out by the first daughter of the next node minus one. For instance, the daughters of node 1 are 6, 7 and 8, and the daguther of node 9 is 16. The terminal nodes have no daughters; note that memory is not allocated for these terminal cells.
char (2) int (4) float (4) float (4) int (4) Index Context Char Suffix log2 P(char|context) log2 (1 - λ(context.char)) First Dtr 0 n/a n/a n/a n/a -0.6322682 1 1 "" a 0 -2.6098135 -0.4150375 6 2 "" b 0 -3.8987012 -0.5849625 9 3 "" c 0 -4.845262 -0.32192808 10 4 "" d 0 -4.845262 -0.32192808 11 5 "" r 0 -3.8987012 -0.5849625 12 6 "a" b 2 -2.5122285 -0.5849625 13 7 "a" c 3 -3.4966948 -0.32192808 14 8 "a" d 4 -3.4966948 -0.32192808 15 9 "b" r 5 -1.4034244 -0.5849625 16 10 "c" a 1 -1.5948515 -0.32192808 17 11 "d" a 1 -1.5948515 -0.32192808 18 12 "r" a 1 -1.1760978 -0.32192808 19 13 "ab" r 9 -0.77261907 This space intentionally left left blank: maximum length n-grams are not contexts. 14 "ac" a 10 -1.1051782 15 "ad" a 11 -1.1051782 16 "br" a 12 -0.6703262 17 "ca" d 8 -1.8843123 18 "da" b 6 -1.5554274 19 "ra" c 7 -1.8843123
The second column indicates the context derived by walking from the root node (index 0) down to the node. For instance, the path "bra" leads from the root node 0 ("") to the node 2 ("b"), to node 9 ("br"), and finally to node 16 ("bra"). The tree is traversed by starting at the root node and looking up daughters by binary search.
If a full context and node are in the tree, the estimate can
simply be looked up. For instance,
log2 P(b) = -3.898,
log2 P(r|b) = -1.403, and
log2 P(a|br) = -0.670. If the
context is available, but not the outcome, the result is just the
addition of the log one minus lambda estimates down to the first
context for which the outcome exists. These contexts to explore
are all suffixes of the context currently being explored and may be
looked up in the suffix array. For instance, consider:
P(c|br)
= λ(br) PML(c|br)
+ (1-λ(br)) P(c|b)
In this case, the outcome c is not available for the
context br (the only daughter is a),
hence the maximum likelihood estimate is zero, and the result
reduces to the second term:
P(c|br) = (1-λ(br)) P(c|b)
and hence
log2 P(c|br)
= log2((1-λ(br)) P(c|b))
= log2(1-λ(br))
+ log2 P(c|b)
This continues until the outcome is found. In this case,
we continue with the next term in the same way:
= log2 (1-λ(br))
+ log2 (1- λ(b))
+ log2 P(c)
= -0.584 + -0.584 + -0.4845
Because 3-grams are the upper bound on context length, and terminal
nodes have no daughters, we conserve length at the end of the
smoothing and daughter arrays.
In practice, this smoothing may require going all the way down to the uniform model. The uniform model estimate is stored separately. The interpolation parameter for the root node works as for any other node.
For estimates of sequences, the final node used will act as the first potential context for the next estimate. This replaces a number of lookups equal to the n-gram length by binary character search with a simple array lookup.
Implementation Note: The suffix indices are not included in the binary serialization format. Instead, they are initialized right after the binary data is read in. This requires walking the trie-structure and computing each node's suffix by doing a walk from the root. For instance, a million node 8-gram model will require at most 8 million binary character searches during initialization.
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized| Modifier and Type | Field and Description |
|---|---|
static int |
ROOT_NODE_INDEX
The index of the root node, namely
0. |
| Modifier and Type | Method and Description |
|---|---|
double |
log2ConditionalEstimate(char[] cs,
int start,
int end)
Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters.
|
double |
log2ConditionalEstimate(CharSequence cSeq)
Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters.
|
double |
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.
|
double |
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.
|
double |
log2Estimate(int contextIndex,
char nextChar)
Returns the log (base 2) estimate of the specified character in
the context with the specified index.
|
double |
log2Prob(CharSequence cSeq)
This method is a convenience impelementation of the
Model interface which delegates the call to log2Estimate(CharSequence). |
int |
longestContextIndex(String context)
Return the index in the parallel array structure underlying of
the maximum length suffix of the specified string that is
defined as a context.
|
int |
maxNGram()
Returns the maximum length n-gram used for this compiled
language model.
|
int |
nextContext(int contextIndex,
char nextChar)
Returns the index of the context formed by appending the
specified character to the context of the specified index.
|
int |
numNodes()
Returns the total number of nodes in this language model's trie
structure.
|
char[] |
observedCharacters()
Returns the array of characters that have been observed for this
language model.
|
double |
prob(CharSequence cSeq)
This method is a convenience implementation of the
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence). |
String |
toString()
Returns a string-based representation of this compiled n-gram.
|
public static final int ROOT_NODE_INDEX
0.public char[] observedCharacters()
It is assumed that the observed characters are those stored unigram probabilities in the trie.
observedCharacters in interface LanguageModel.Conditionalpublic int maxNGram()
public int numNodes()
public int longestContextIndex(String context)
context - String of context.public double log2Prob(CharSequence cSeq)
Model interface which delegates the call to log2Estimate(CharSequence).log2Prob in interface Model<CharSequence>cSeq - Character sequence whose probability is returned.public double prob(CharSequence cSeq)
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence).prob in interface Model<CharSequence>cSeq - Character sequence whose probability is returned.public final double log2Estimate(CharSequence cSeq)
LanguageModellog2Estimate in interface LanguageModelcSeq - Character sequence to estimate.public final double log2Estimate(int contextIndex,
char nextChar)
The main use of this method is to incrementally compute
conditional estimates and contexts, in conjunction with the
method nextContext(int,char).
contextIndex - Index of context of estimate.nextChar - Character being estimated.public int nextContext(int contextIndex,
char nextChar)
log2Estimate(int,char).
Note that the index of the root node is always 0.
contextIndex - Index of present context.nextChar - Next character.IllegalArgumentException - If the context index is less
than zero or greater than the last context index.public final double log2Estimate(char[] cs,
int start,
int end)
LanguageModellog2Estimate in interface LanguageModelcs - Underlying array of characters.start - Index of first character in slice.end - One plus index of last character in slice.public double log2ConditionalEstimate(CharSequence cSeq)
LanguageModel.Conditionallog2ConditionalEstimate in interface LanguageModel.ConditionalcSeq - Character sequence to estimate.public double log2ConditionalEstimate(char[] cs,
int start,
int end)
LanguageModel.Conditionallog2ConditionalEstimate in interface LanguageModel.Conditionalcs - Underlying array of characters.start - Index of first character in slice.end - One plus the index of the last character in the slice.public String toString()
Copyright © 2019 Alias-i, Inc.. All rights reserved.