public class TokenizedLM extends Object implements LanguageModel.Dynamic, LanguageModel.Sequence, LanguageModel.Tokenized, ObjectHandler<CharSequence>
TokenizedLM provides a dynamic sequence language
model which models token sequences with an n-gram model, and
whitespace and unknown tokens with their own sequence language
models.
A tokenized language model factors the probability assigned to a character sequence as follows:
P(cs)
= Ptok(toks(cs))
Πt in unknownToks(cs)
Punk(t)
Πw in whitespaces(cs)
Pwhsp(w)
where
Ptok is the token model
esimate and where toks(cs) replaces known tokens with
their integer identifiers, unknown tokens with -1 and
adds boundary symbols -2 front and back, the same
adjustment is used to remove the initial boundary estimate as in
NGramBoundaryLM;
Punk is the unknown token
sequence language model and unknownToks(cs) is the
list of unknown tokens in the input (with duplicates); and
Pwhsp is the whitespace sequence
language model and whitespaces(cs) is the list of
whitespaces in the character sequence (with duplicates).
The token n-gram model itself uses the same method of counting
and smoothing as described in the class documentation for NGramProcessLM. Like NGramBoundaryLM, boundary tokens are
inserted before and after other tokens. And like the n-gram
character boundary model, the initial boundary estimate is subtracted
from the overall estimate for normalization purposes.
Tokens are all converted to integer identifiers using an
internal dynamic symbol table. All symbols in symbol tables get
non-negative identifiers; the negative value -1 is
used for the unknown token in models, just as in symbol tables.
The value -2 is used for the boundary marker in the
counters.
In order for all estimates to be non-zero, the integer sequence counter used to back the token model is initialized with a count of 1 for the end-of-stream identifier (-2). The unknown token count for any context is taken to be the number of outcomes in that context. Because unknowns are estimated directly in this manner, there is no need to interpolate the unigram model with a uniform model for unknown outcome. Instead, the occurrence of an unknown is modeled directly and its identity is modeled by the unknown token language model.
In order to produce a properly normalized sequence model, the
concatenation of tokens and whitespaces returned by the tokenizer
should concatenate together to produce the original input. Note
that this condition is not checked at runtime. But,
sequences may be normalized before being trained and evaluated for
a language model. For instance, all alphabetic characters might be
reduced to lower case and all punctuation characters removed and
all non-empty sequences of whitespace reduced to a single space
character. A langauge model may then be defined over this
normalized space of input, not the original space (and may thus use
a reduced number of characters for its uniform estimates).
Although this normalization may be carried out by a tokenizer in
practice, for instance for use in a tokenized classifier, an
normalization is consistent the interface specification for LanguageModel.Sequence or LanguageModel.Dynamic only if
done on the outside.
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized| Modifier and Type | Field and Description |
|---|---|
static int |
BOUNDARY_TOKEN
The symbol used for boundaries in the counter, -2.
|
static int |
UNKNOWN_TOKEN
The symbol used for unknown symbol IDs.
|
| Constructor and Description |
|---|
TokenizedLM(TokenizerFactory factory,
int nGramOrder)
Constructs a tokenized language model with the specified
tokenization factory and n-gram order (see warnings below on
where this simple constructor may be used).
|
TokenizedLM(TokenizerFactory tokenizerFactory,
int nGramOrder,
LanguageModel.Sequence unknownTokenModel,
LanguageModel.Sequence whitespaceModel,
double lambdaFactor)
Construct a tokenized language model with the specified
tokenization factory and n-gram order, sequence models for
unknown tokens and whitespace, and an interpolation
hyperparameter.
|
TokenizedLM(TokenizerFactory tokenizerFactory,
int nGramOrder,
LanguageModel.Sequence unknownTokenModel,
LanguageModel.Sequence whitespaceModel,
double lambdaFactor,
boolean initialIncrementBoundary)
Construct a tokenized language model with the specified
tokenization factory and n-gram order, sequence models for
unknown tokens and whitespace, and an interpolation
hyperparameter, as well as a flag indicating whether to
automatically increment a null input to avoid numerical
problems with zero counts.
|
| Modifier and Type | Method and Description |
|---|---|
double |
chiSquaredIndependence(int[] nGram)
Returns the maximum value of Pearson's C2
independence test statistic resulting from splitting the
specified n-gram in half to derive a contingency matrix.
|
SortedSet<ScoredObject<String[]>> |
collocationSet(int nGram,
int minCount,
int maxReturned)
Returns an array of collocations in order of confidence that
their token sequences are not independent.
|
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this tokenized language model to
the specified object output.
|
SortedSet<ScoredObject<String[]>> |
frequentTermSet(int nGram,
int maxReturned)
Returns the most frequent n-gram terms in the training data up
to the specified maximum number.
|
void |
handle(CharSequence cs)
Trains the language model on the specified character sequence.
|
void |
handleNGrams(int nGramLength,
int minCount,
ObjectHandler<String[]> handler)
Visits the n-grams of the specified length with at least the specified
minimum count stored in the underlying counter of this
tokenized language model and passes them to the specified handler.
|
SortedSet<ScoredObject<String[]>> |
infrequentTermSet(int nGram,
int maxReturned)
Returns the least frequent n-gram terms in the training data up
to the specified maximum number.
|
double |
lambdaFactor()
Returns the interpolation ratio, or lambda factor,
for interpolating in this tokenized language model.
|
double |
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.
|
double |
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.
|
SortedSet<ScoredObject<String[]>> |
newTermSet(int nGram,
int minCount,
int maxReturned,
LanguageModel.Tokenized backgroundLM)
Returns a list of scored n-grams ordered by the significance
of the degree to which their counts in this model exceed their
expected counts in a specified background model.
|
int |
nGramOrder()
Returns the order of the token n-gram model underlying this
tokenized language model.
|
SortedSet<ScoredObject<String[]>> |
oldTermSet(int nGram,
int minCount,
int maxReturned,
LanguageModel.Tokenized backgroundLM)
Returns a list of scored n-grams ordered in reverse order
of significance with respect to the background model.
|
double |
processLog2Probability(String[] tokens)
Returns the probability of the specified tokens in the
underlying token n-gram distribution.
|
TrieIntSeqCounter |
sequenceCounter()
Returns the integer sequence counter underlying this model.
|
SymbolTable |
symbolTable()
Returns the symbol table underlying this tokenized language
model's token n-gram model.
|
TokenizerFactory |
tokenizerFactory()
Returns the tokenizer factory for this tokenized language
model.
|
double |
tokenLog2Probability(String[] tokens,
int start,
int end)
Returns the log (base 2) probability of the specified
token slice in the underlying token n-gram distribution.
|
double |
tokenProbability(String[] tokens,
int start,
int end)
Returns the probability of the specified token slice in the
token n-gram distribution.
|
String |
toString()
Returns a string-based representation of the token
counts for this language model.
|
void |
train(char[] cs,
int start,
int end)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic).
|
void |
train(char[] cs,
int start,
int end,
int count)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic).
|
void |
train(CharSequence cSeq)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic).
|
void |
train(CharSequence cSeq,
int count)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic) with the specified count number
of instances.
|
void |
trainSequence(CharSequence cSeq,
int count)
This method increments the count of the entire sequence
specified.
|
LanguageModel.Sequence |
unknownTokenLM()
Returns the unknown token seqeunce language model for this
tokenized language model.
|
LanguageModel.Sequence |
whitespaceLM()
Returns the whitespace language model for this tokenized
language model.
|
double |
z(int[] nGram,
int nGramSampleCount,
int totalSampleCount)
Returns the z-score of the specified n-gram with the specified
count out of a total sample count, as measured against the
expectation of this tokenized language model.
|
public static final int UNKNOWN_TOKEN
public static final int BOUNDARY_TOKEN
public TokenizedLM(TokenizerFactory factory, int nGramOrder)
The unknown token and whitespace models are both uniform
sequence language models with default parameters as described
in the documentation for the constructor UniformBoundaryLM.UniformBoundaryLM(). The default
interpolation hyperparameter is equal to the n-gram Order.
Warning: This construction method is probably only going to be useful if you are only using the tokenized LM to store character n-grams. Because it uses fat constant uniform language models for smoothing tokens and whitespaces, it will provide very high entropy estimates for unseen text. The other constructors allow smoothing LMs to be supplied (which will take up more space to estimate, but will provide more reasonable estimates).
factory - Tokenizer factory for the model.nGramOrder - N-gram Order.IllegalArgumentException - If the n-gram order is less
than 0.public TokenizedLM(TokenizerFactory tokenizerFactory, int nGramOrder, LanguageModel.Sequence unknownTokenModel, LanguageModel.Sequence whitespaceModel, double lambdaFactor)
In order for this model to be serializable, the unknown
token and whitespace models should be serializable. If they do
not, a runtime exception will be thrown when attempting to
serialize this model. If these models implement LanguageModel.Dynamic, they will be trained by calls to the
training method.
tokenizerFactory - Tokenizer factory for the model.nGramOrder - Length of maximum n-gram for model.unknownTokenModel - Sequence model for unknown tokens.whitespaceModel - Sequence model for all whitespace.lambdaFactor - Value of the interpolation hyperparameter.IllegalArgumentException - If the n-gram order is less
than 1 or the interpolation is not a non-negative number.public TokenizedLM(TokenizerFactory tokenizerFactory, int nGramOrder, LanguageModel.Sequence unknownTokenModel, LanguageModel.Sequence whitespaceModel, double lambdaFactor, boolean initialIncrementBoundary)
In order for this model to be serializable, the unknown
token and whitespace models should be serializable. If they do
not, a runtime exception will be thrown when attempting to
serialize this model. If these models implement LanguageModel.Dynamic, they will be trained by calls to the
training method.
tokenizerFactory - Tokenizer factory for the model.nGramOrder - Length of maximum n-gram for model.unknownTokenModel - Sequence model for unknown tokens.whitespaceModel - Sequence model for all whitespace.lambdaFactor - Value of the interpolation hyperparameter.initialIncrementBoundary - Flag indicating whether or not
to increment the subsequence { BOUNDARY_TOKEN }
automatically after construction to avoid NaN error
states.IllegalArgumentException - If the n-gram order is less
than 1 or the interpolation is not a non-negative number.public double lambdaFactor()
public TrieIntSeqCounter sequenceCounter()
symbolTable(). Changes to this counter affect this
tokenized language model.public SymbolTable symbolTable()
public int nGramOrder()
public TokenizerFactory tokenizerFactory()
public LanguageModel.Sequence unknownTokenLM()
public LanguageModel.Sequence whitespaceLM()
public void compileTo(ObjectOutput objOut) throws IOException
CompiledTokenizedLM.compileTo in interface CompilableobjOut - Object output to which a compiled version of this
model is written.IOException - If there is an I/O error writing the
output.public void handleNGrams(int nGramLength,
int minCount,
ObjectHandler<String[]> handler)
nGramLength - Length of n-grams visited.minCount - Minimum count of a visited n-gram.handler - Handler whose handle method is called for each
visited n-gram.public void train(CharSequence cSeq)
train in interface LanguageModel.DynamiccSeq - Character sequence to train.public void train(CharSequence cSeq, int count)
train(cs,n) is equivalent to
calling train(cs) a total of n times.train in interface LanguageModel.DynamiccSeq - Character sequence to train.count - Number of instances to train.IllegalArgumentException - If the count is not positive.public void train(char[] cs,
int start,
int end)
train in interface LanguageModel.Dynamiccs - Underlying character array.start - Index of first character in slice.end - Index of one plus last character in slice.IndexOutOfBoundsException - If the indices are out of
range for the character array.public void handle(CharSequence cs)
This method delegates to the train(CharSequence,int) method.
This method implements the ObjectHandler<CharSequence>
interface.
handle in interface ObjectHandler<CharSequence>cs - Object to be handled.public void train(char[] cs,
int start,
int end,
int count)
train in interface LanguageModel.Dynamiccs - Underlying character array.start - Index of first character in slice.end - Index of one plus last character in slice.count - Number of instances of sequence to train.IndexOutOfBoundsException - If the indices are out of range for the
character array.IllegalArgumentException - If the count is negative.public void trainSequence(CharSequence cSeq, int count)
This method may be used to train a tokenized language model
from individual character sequence counts. Because the token
smoothing models are not implemented for this method, a pure
token model may be constructed by calling
train(CharSequence,int) for character sequences
corresponding to unigrams rather than this method in order to
train token smoothing with character subseuqneces.
For instance, with
com.aliasi.tokenizer.IndoEuropeanTokenizerFactory,
the sequence calling trainSequence("the fast
computer",5) would extract three tokens,
the, fast and computer,
and would increment the count of the three-token sequence, but
not any of its subsequences.
If the number of tokens is longer than the maximum n-gram
length, only the final tokens are trained. For instance, with
an n-gram length of 2, and the Indo-European tokenizer factory,
calling trainSequence("a slightly faster
computer",93) is equivalent to calling
trainSequence("faster computer",93).
All tokens trained are added to the symbol table. This does not include any initial tokens that are not used because the maximum n-gram length is too short.
cSeq - Character sequence to train.count - Number of instances to train.IllegalArgumentException - If the count is negative.public double log2Estimate(CharSequence cSeq)
LanguageModellog2Estimate in interface LanguageModelcSeq - Character sequence to estimate.public double log2Estimate(char[] cs,
int start,
int end)
LanguageModellog2Estimate in interface LanguageModelcs - Underlying array of characters.start - Index of first character in slice.end - One plus index of last character in slice.public double tokenProbability(String[] tokens, int start, int end)
LanguageModel.TokenizedtokenProbability in interface LanguageModel.Tokenizedtokens - Underlying array of tokens.start - Index of first token in slice.end - Index of one past the last token in the slice.public double tokenLog2Probability(String[] tokens, int start, int end)
LanguageModel.TokenizedtokenLog2Probability in interface LanguageModel.Tokenizedtokens - Underlying array of tokens.start - Index of first token in slice.end - Index of one past the last token in the slice.public double processLog2Probability(String[] tokens)
tokens - Tokens whose probability is returned.public SortedSet<ScoredObject<String[]>> collocationSet(int nGram, int minCount, int maxReturned)
String[] containing tokens. The length of n-gram,
minimum count for a result and the maximum number of results
returned are all specified. The confidence ordering is based
on the result of Pearson's C2
independence statistic as computed by chiSquaredIndependence(int[]).nGram - Length of n-grams to search for collocations.minCount - Minimum count for a returned n-gram.maxReturned - Maximum number of results returned.public SortedSet<ScoredObject<String[]>> newTermSet(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
ScoredObject instances
whose objects are terms represented as string arrays and whose
scores are the collocation score for the term. For instance,
the new terms may be printed in order of significance by:
ScoredObject[] terms = new Terms(3,5,100,bgLM);
for (int i = 0; i < terms.length; ++i) {
String[] term = (String[]) terms[i].getObject();
double score = terms[i].score();
...
}
The exact scoring used is the z-score as defined in BinomialDistribution.z(double,int,int) with the success
probability defined by the n-grams probability estimate in the
background model, the number of successes being the count of
the n-gram in this model and the number of trials being the
total count in this model.
See oldTermSet(int,int,int,LanguageModel.Tokenized)
for a method that returns the least significant terms in
this model relative to a background model.
nGram - Length of n-grams to search for significant new terms.minCount - Minimum count for a returned n-gram.maxReturned - Maximum number of results returned.backgroundLM - Background language model against which
significance is measured.public SortedSet<ScoredObject<String[]>> oldTermSet(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
Note that only terms that exist in the foreground model are
considered. By contrast, reversing the roles of the models in
the sister method newTermSet(int,int,int,LanguageModel.Tokenized) considers
every n-gram in the background model and may return slightly
different results.
nGram - Length of n-grams to search for significant old terms.minCount - Minimum count in background model for a returned n-gram.maxReturned - Maximum number of results returned.backgroundLM - Background language model from which counts are
derived.public SortedSet<ScoredObject<String[]>> frequentTermSet(int nGram, int maxReturned)
See infrequentTermSet(int,int) to retrieve the most
infrequent terms.
nGram - Length of n-grams to search.maxReturned - Maximum number of results returned.public SortedSet<ScoredObject<String[]>> infrequentTermSet(int nGram, int maxReturned)
See frequentTermSet(int,int) to retrieve the most
frequent terms.
nGram - Length of n-grams to search.maxReturned - Maximum number of results returned.public double chiSquaredIndependence(int[] nGram)
The input n-gram is split into two halves,
Term1 and
Term2, each of which is a
non-empty sequence of integers.
Term1 consists of the tokens
indexed 0 to mid-1 and
Term2 from mid
to end-1.
The contingency matrix for computing the independence statistic is:
where values for a specified integer sequence
+Term2 -Term2 +Term1 Term(+,+) Term(+,-) -Term1 Term(-,+) Term(-,-)
nGram and midpoint 0 < mid < end is:
Term(+,+) = count(nGram,0,end)
Term(+,-) = count(nGram,0,mid) - count(nGram,0,end)
Term(-,+) = count(nGram,mid,end) - count(nGram,0,end)
Term(-,-) = totalCount - Term(+,+) - Term(+,-) - Term(-,+)
Note that using the overall total count provides a slight
overapproximation of the count of appropriate-length n-grams.
For further information on the independence test, see the
documentation for Statistics.chiSquaredIndependence(double,double,double,double).
nGram - Array of integers whose independence
statistic is returned.IllegalArgumentException - If the specified n-gram is not at
least two elements long.public double z(int[] nGram,
int nGramSampleCount,
int totalSampleCount)
Formulas for z-scores and an explanation of their scaling by
deviation is described in the documentation for the static
method BinomialDistribution.z(double,int,int).
nGram - The n-gram to test.nGramSampleCount - The number of observations of the
n-gram in the sample.totalSampleCount - The total number of samples.Copyright © 2016 Alias-i, Inc.. All rights reserved.