public class CharLmRescoringChunker extends AbstractCharLmRescoringChunker<CharLmHmmChunker,NGramProcessLM,NGramBoundaryLM> implements ObjectHandler<Chunking>, Compilable
CharLmRescoringChunker provides a long-distance
character language model-based chunker that operates by rescoring
the output of a contained character language model HMM chunker.
This model performs rescoring over an underlying chunker.
The underlying chunkeris an instance of CharLmHmmChunker,
configured with the specified tokenizer factory, n-gram length,
number of characters and interpolation ratio provided in the
constructor. The underlying chunker may be configured after
retrieving it through the superclass's RescoringChunker.baseChunker()
method. The typical use of this is to configure caching.
The rescoring model used by this chunker is based on a bounded
character language model per chunk type with an additional
process character language model for text not in chunks. The
remaining details are described in the class documentation for
the superclass AbstractCharLmRescoringChunker.
This chunker is trained in the usual way through calls to the
appropriate handle() method. The method handle(Chunking) implements the ObjectHandler<Chunking> interface
and allows for training through chunking examples.
A model is compiled by calling the Compilable interface method compileTo(ObjectOutput).
The compiled model is an instance of a AbstractCharLmRescoringChunker,
and its underlying chunker may be recovered that way.
The underlying chunker is recoverable as a character language
model HMM chunker through RescoringChunker.baseChunker(). The
non-chunk process n-gram character language model is returned by
AbstractCharLmRescoringChunker.outLM(), whereas the chunk models are returned
by AbstractCharLmRescoringChunker.chunkLM(String).
The components of a character LM rescoring chunker are accessible in their training format for methods on this class, as described above.
The compiled models are instances of RescoringChunker,
which allow their underlying chunker to be retrieved through RescoringChunker.baseChunker() and then configured. The other run-time models, for
may be retrieved through the superclass's
The tag BOS is reserved for use by the system
for encoding document start/end positions. See HmmChunker
for more information.
| Constructor and Description |
|---|
CharLmRescoringChunker(TokenizerFactory tokenizerFactory,
int numChunkingsRescored,
int nGram,
int numChars,
double interpolationRatio)
Construct a character language model rescoring chunker based on
the specified components.
|
CharLmRescoringChunker(TokenizerFactory tokenizerFactory,
int numChunkingsRescored,
int nGram,
int numChars,
double interpolationRatio,
boolean smoothTags)
Construct a character language model rescoring chunker based on
the specified components.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compiles this model to the specified object output stream.
|
void |
handle(Chunking chunking)
Trains this chunker with the specified chunking.
|
void |
trainDictionary(CharSequence cSeq,
String type)
Provides the specified character sequence data as training data
for the language model of the specfied type.
|
void |
trainOut(CharSequence cSeq)
Trains the language model for non-entities using the specified
character sequence.
|
chunkLM, outLM, rescore, typeToCharbaseChunker, chunk, chunk, nBest, nBestChunks, numChunkingsRescored, setNumChunkingsRescoredpublic CharLmRescoringChunker(TokenizerFactory tokenizerFactory, int numChunkingsRescored, int nGram, int numChars, double interpolationRatio)
CharLmRescoringChunker(TokenizerFactory,int,int,int,double,boolean).tokenizerFactory - Tokenizer factory for boundaries.numChunkingsRescored - Number of underlying chunkings rescored.nGram - N-gram length for all models.numChars - Number of characters in the training and
run-time character sets.interpolationRatio - Underlying language-model
interpolation ratios.public CharLmRescoringChunker(TokenizerFactory tokenizerFactory, int numChunkingsRescored, int nGram, int numChars, double interpolationRatio, boolean smoothTags)
Whether tags are smoothed in the underlying model is determined
by the flag in the constructor. See CharLmHmmChunker's
class documentation for more information on the effects of
smoothing.
tokenizerFactory - Tokenizer factory for boundaries.numChunkingsRescored - Number of underlying chunkings rescored.nGram - N-gram length for all models.numChars - Number of characters in the training and
run-time character sets.interpolationRatio - Underlying language-model
interpolation ratios.smoothTags - Set to true to smooth tags in underlying
chunker.public void handle(Chunking chunking)
handle in interface ObjectHandler<Chunking>chunking - Training data.public void compileTo(ObjectOutput objOut) throws IOException
ObjectInput.readObject(); the resulting object will be
an instance of AbstractCharLmRescoringChunker.compileTo in interface CompilableobjOut - Object output to which this object is compiled.IOException - If there is an I/O error during the write.IllegalArgumentException - If the tokenizer factory supplied to
the constructor of this class is not compilable.public void trainDictionary(CharSequence cSeq, String type)
Warning: It is not sufficient to only train a model using this method. Annotated data with a representative balance of entities and non-entity text is required to train the overall likelihood of entities and the contexts in which they occur. Use of this method will not bias the likelihoods of entities occurring. But, it might cause the common entities in the training data to be overwhelmed if a large dictionary is used. One possibility is to train the basic data multiple times relative to the dictionary (or vice-versa).
cSeq - Character sequence for training.type - Type of character sequence.public void trainOut(CharSequence cSeq)
Warning: Training using this method biases the likelihood of entities downward, because it does not train the likelihood of a non-entity character sequence ending and being followed by an entity of a specified type. Thus this method is best used to seed a dictionary of common words that are relatively few in number relative to the entity-annotated training data.
cSeq - Data to train the non-entity (out) model.Copyright © 2019 Alias-i, Inc.. All rights reserved.