L - the type of dynamic language model for this classifierpublic class DynamicLMClassifier<L extends LanguageModel.Dynamic> extends LMClassifier<L,MultivariateEstimator> implements ObjectHandler<Classified<CharSequence>>, Compilable
DynamicLMClassifier is a language model classifier
that accepts training events of categorized character sequences.
Training is based on a multivariate estimator for the category
distribution and dynamic language models for the per-category
character sequence estimators. These models also form the basis of
the superclass's implementation of classification.
Because this class implements training and classification, it may be used in tag-a-little, learn-a-little supervised learning without retraining epochs. This makes it ideal for active learning applications, for instance.
At any point after adding training events, the classfier may be
compiled to an object output. The classifier read back in will be
a non-dynamic instance of LMClassifier. It will be based
on the compiled version of the multivariate estimator and the
compiled version of the dynamic language models for the categories.
Instances of this class allow concurrent read operations but
require writes to run exclusively. Reads in this context are
either calculating estimates or compiling; writes are training.
Extensions to LingPipe's classes may impose tighter restrictions.
For instance, a subclass of MultivariateEstimator
might be used that does not allow concurrent estimates; in that
case, its restrictions are passed on to this classifier. The same
goes for the language models and in the case of token language
models, the tokenizer factories.
LMClassifier<LanguageModel,MultivariateDistribution>. The actual
language model will be the compiled version of the language model
in the classifier that was compiled, which varies by the type of
dynamic language model created. For instance, the dynamic LM
classifiers produced by the factory methods createNGramBoundary(),
createNGramProcess() and createTokenized() deserialize
with language models that are instances of
LanguageModel.Sequence, LanguageModel.Process and
LanguageModel.Tokenized respectively.| Constructor and Description |
|---|
DynamicLMClassifier(String[] categories,
L[] languageModels)
Construct a dynamic language model classifier over the
specified categories with specified language
models per category and an overall category estimator.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this classifier to the specified
object output.
|
static DynamicLMClassifier<NGramBoundaryLM> |
createNGramBoundary(String[] categories,
int maxCharNGram)
Construct a dynamic classifier over the specified cateogries,
using boundary character n-gram models of the specified order.
|
static DynamicLMClassifier<NGramProcessLM> |
createNGramProcess(String[] categories,
int maxCharNGram)
Construct a dynamic classifier over the specified categories,
using process character n-gram models of the specified order.
|
static DynamicLMClassifier<TokenizedLM> |
createTokenized(String[] categories,
TokenizerFactory tokenizerFactory,
int maxTokenNGram)
Construct a dynamic language model classifier over the
specified categories using token n-gram language models of the
specified order and the specified tokenizer factory for
tokenization.
|
void |
handle(Classified<CharSequence> classified)
Provides a training instance for the specified character
sequence using the best category from the specified
classification.
|
void |
resetCategory(String category,
L lm,
int newCount)
Resets the specified category to the specified language model.
|
void |
train(String category,
CharSequence sampleCSeq,
int count)
Provide a training instance for the specified category
consisting of the specified sample character sequence with the
specified count.
|
categories, categoryDistribution, classify, classifyJoint, languageModelpublic DynamicLMClassifier(String[] categories, L[] languageModels)
The multivariate estimator over categories is initialized
with one count for each category. Technically, initializing
counts involves a uniform Dirichlet prior with
α=1, which is often called Laplace
smoothing.
categories - Categories used for classification.languageModels - Dynamic language models for categories.IllegalArgumentException - If there are not at least two
categories, or if the length of the category and language model
arrays is not the same, or if there are duplicate categories.public void train(String category, CharSequence sampleCSeq, int count)
train(String,char[],int,int).
Counts of zero are ignored, whereas counts less than zero raise an exception.
category - Category of this training sequence.sampleCSeq - Category sequence for training.count - Number of training instances.IllegalArgumentException - If the category is not known
or if the count is negative.public void handle(Classified<CharSequence> classified)
handle in interface ObjectHandler<Classified<CharSequence>>classified - Classified character sequence to treat as
training data.public void compileTo(ObjectOutput objOut) throws IOException
LMClassifier.compileTo in interface CompilableobjOut - Object output to which this classifier is
written.IOException - If there is an I/O exception writing to
the output stream.public void resetCategory(String category, L lm, int newCount)
category - Category to reset.lm - New dynamic language model for category.newCount - New count for category.IllegalArgumentException - If the category is not known.public static DynamicLMClassifier<NGramProcessLM> createNGramProcess(String[] categories, int maxCharNGram)
See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for
information on the category multivariate estimate for priors.
categories - Categories used for classification.maxCharNGram - Maximum length of character sequence
counted in model.IllegalArgumentException - If there are not at least two
categories or if there are duplicate categories.public static DynamicLMClassifier<NGramBoundaryLM> createNGramBoundary(String[] categories, int maxCharNGram)
See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for
information on the category multivariate estimate for priors.
categories - Categories used for classification.maxCharNGram - Maximum length of character sequence
counted in model.IllegalArgumentException - If there are not at least two
categories or if there are duplicate categories.public static DynamicLMClassifier<TokenizedLM> createTokenized(String[] categories, TokenizerFactory tokenizerFactory, int maxTokenNGram)
The multivariate estimator over categories is initialized with one count for each category.
The unknown token and whitespace models are uniform sequence models.
categories - Categories used for classification.maxTokenNGram - Maximum length of token n-grams used.tokenizerFactory - Tokenizer factory for tokenization.IllegalArgumentException - If there are not at least two
categories or if there are duplicate categories.Copyright © 2016 Alias-i, Inc.. All rights reserved.