public class CharLmHmmChunker extends HmmChunker implements ObjectHandler<Chunking>, Compilable
CharLmHmmChunker employs a hidden Markov model
estimator and tokenizer factory to learn a chunker. This
estimator used is an instance of AbstractHmmEstimator
for underlying HMM estimation. It uses a tokenizer factory to
break the chunks down into sequences of tokens and tags.
This class implements the ObjectHandler<Chunking>, which
may be used to supply training instances. Every training event is
used to train the underlying HMM. Training instances are supplied
through the chunk handler in the usual way.
Training instances for the tag handler
require the standard BIO tagging scheme in which the first token in
a chunk of type X is tagged
B-X ("begin"), with all subsequent
tokens in the same chunk tagged I-X
("in"). All tokens not in chunks are tagged
O. For example, the tags required for training are:
This is the same tagging scheme supplied in several corpora (Penn BioIE, ConNLL, etc.) Note that this is not the same tag scheme used for the underlying HMM. This simpler tag scheme shown above is first converted to the more fine-grained tag scheme described in the class documentation forYestereday O afternoon O , O John B-PER J I-PER . I-PER Smith I-PER traveled O to O Washington O . O
HmmChunker.
trainDictionary(CharSequence cSeq, String type).
Calling this method trains the emission probabilities for
the relevant tags determined by tokenizing the specifid character
sequence (after conversion to the underlying tag scheme defined
in HmmChunker).
Warning:It is not enough to just train with a dictionary. Dictionaries do not train the contexts in which elements show up. Ordinary training data must also be supplied, and this data must have some elements which are not part of chunks in order to train the out tags. If only a dictionary is used to train, null pointer exceptions will show up at run time.
For example, calling
charLmHmmChunker.trainDictionary("Washington", "LOCATION");
would provide the token "Washington" as a training case
for emission from the tag W_LOCATION--the 'W_'
annotation is emitted because the trainDictionary uses the richer tag
set of HmmChunker. Alterantively, calling:
charLmHmmChunker.trainDictionary("John J. Smith", "PERSON");
would train the tag B_PERSON to be trained
with the sequence "John", the tag I_PERSON
to be trained with the tokens "J" and ".",
and the tag E_PERSON to be trained with the
token "Smith". Furthermore, in this case, the transition
probabilities receive training instances for the three
transitions: B_PERSON to M_PERSON,
M_PERSON to M_PERSON, and finally,
M_PERSON to E_PERSON.
Note that there is no method to train non-chunk tokens, because the categories assigned to them are context-specific, being determined by the surrounding tokens. An effective way to train out categories in general is to supply them as part of entire sentences that have no chunks in them. Note that this only trains the begin-sentence, end-sentence and internal tags for non-chunked tokens.
To be useful, the dictionary entries must match the chunks that
should be found. For instance, in the MUC training data, there are
many instances of USAir, the name of a United States
airline. It might be thought that stock listings would help the
extraction of company names, but in fact, the company is
"officially" known as USAirways Group.
It is also important that training with dictionaries not be done with huge diffuse dictionaries that wind up smoothing the language models too much. For example, training just locations with a 2 million location gazzeteer once per entry will leave obscure locations with an estimate close to those of New York or Beijing.
The constructor CharLmHmmChunker(TokenizerFactory,AbstractHmmEstimator,boolean))}
accepts a flag that determines whether to smooth tag transition
probabilities. If the flag is set to true in the
constructor, every time a new symbol is seen in the training data,
all of its relevant underlying tags are added to the symbol table
and all legal transitions among them and all other tags are
incremented by one.
If smoothing is turned off, only tag-tag transitions seen in the training data are allowed.
The begin-sentence and end-sentence tags are automatically added
in the constructor, so that if no training data is provided, a
chunking with no chunks is returned. This smoothing may not be
turned off. Thus there will always be a non-zero probability in
the underlying HMM of starting with tag BB_O_BOS and
WW_O_BOS, of ending with the tag EE_O_BOS
or WW_O_BOS. There will also always be a non-zero
probability of transitioning from
BB_O_BOS to MM_O and
to EE_O_BOS, and of transitioning from MM_O to
MM_O and EE_O_BOS.
This class implements the Compilable interface. To
compile a static model from the current state of training, call the
method compileTo(ObjectOutput). The result of reading an
object from the corresponding object input stream will produce a
compiled HMM chunker of class HmmChunker, with the same
estimates as the current state of the chunker being compiled.
Caching is turned off on the HMM decoder for this class by default.
If caching is turned on for instances of this class (through the
method HmmChunker.getDecoder() inherited from
HmmChunker), then training instances will fail to be
reflected in cached estimates and the results may be inconsistent
and may lead to exceptions. Caching may be turned on as long as
there are no more training instances, but in this case, it is
almost always more efficient to just compile the model and turn
caching on for that.
After compilation, the returned chunker will have caching turned off by default. To turn on caching for the compiled model, which is highly recommended for efficiency, retrieve the HMM decoder and set its cache. For instance, to set up caching for both log estimates and linear estimates, use the code:
ObjectInput objIn = ...; HmmChunker chunker = (HmmChunker) objIn.readObject(); HmmDecoder decoder = chunker.getDecoder(); decoder.setEmissionCache(new FastCache(1000000)); decoder.setEmissionLog2Cache(new FastCache(1000000));
The tag BOS is reserved for use by the system
for encoding document start/end positions. See HmmChunker
for more information.
| Constructor and Description |
|---|
CharLmHmmChunker(TokenizerFactory tokenizerFactory,
AbstractHmmEstimator hmmEstimator)
Construct a
CharLmHmmChunker from the specified
tokenizer factory and hidden Markov model estimator. |
CharLmHmmChunker(TokenizerFactory tokenizerFactory,
AbstractHmmEstimator hmmEstimator,
boolean smoothTags)
Construct a
CharLmHmmChunker from the specified
tokenizer factory, HMM estimator and tag-smoothing flag. |
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compiles this model to the specified object output stream.
|
static boolean |
consistentTokens(String[] toks,
String[] whitespaces,
TokenizerFactory tokenizerFactory) |
AbstractHmmEstimator |
getHmmEstimator()
Returns the underlying hidden Markov model estimator for this
chunker estimator.
|
TokenizerFactory |
getTokenizerFactory()
Return the tokenizer factory for this chunker.
|
void |
handle(Chunking chunking)
Handle the specified chunking by tokenizing it, assigning tags
and training the underlying hidden Markov model.
|
String |
toString()
Returns a string representation of the complete topology of the
underlying HMM with log2 transition probabilities.
|
void |
trainDictionary(CharSequence cSeq,
String type)
Train the underlying hidden Markov model based on the specified
character sequence being of the specified type.
|
chunk, chunk, getDecoder, nBest, nBestChunks, nBestConditionalpublic CharLmHmmChunker(TokenizerFactory tokenizerFactory, AbstractHmmEstimator hmmEstimator)
CharLmHmmChunker from the specified
tokenizer factory and hidden Markov model estimator. Smoothing
is turned off by default. See CharLmHmmChunker(TokenizerFactory,AbstractHmmEstimator,boolean))
for more information.tokenizerFactory - Tokenizer factory to tokenize chunks.hmmEstimator - Underlying HMM estimator.public CharLmHmmChunker(TokenizerFactory tokenizerFactory, AbstractHmmEstimator hmmEstimator, boolean smoothTags)
CharLmHmmChunker from the specified
tokenizer factory, HMM estimator and tag-smoothing flag.
If smoothing is turned on, then every time a new entity type is seen in the training data, all possible underlying tags involving that object are added to the symbol table, and every legal transition among these tags and all other tags is increment by count 1.
The tokenizer factory must be compilable in order for the model to be compiled. If it is not compilable, then attempting to compile the model will raise an exception.
tokenizerFactory - Tokenizer factory to tokenize chunks.hmmEstimator - Underlying HMM estimator.smoothTags - Set to true for tag smoothing.public AbstractHmmEstimator getHmmEstimator()
public TokenizerFactory getTokenizerFactory()
getTokenizerFactory in class HmmChunkerpublic void trainDictionary(CharSequence cSeq, String type)
Warning: Chunkers cannot only be trained with a dictionary. They require regular training data in order to train the contexts in which dictionary items show up. Attempting to train with only a dictionary will lead to null pointer exceptions when attempting to decode.
cSeq - Character sequence on which to train.type - Type of chunk.public void handle(Chunking chunking)
HmmChunker.handle in interface ObjectHandler<Chunking>chunking - Chunking to use for training.public void compileTo(ObjectOutput objOut) throws IOException
ObjectInput.readObject(); the resulting object will be
instance of HmmChunker. See the class documentation
above for information on setting the cache for a compiled
model.compileTo in interface CompilableobjOut - Object output to which this object is compiled.IOException - If there is an I/O error during the write.public String toString()
public static boolean consistentTokens(String[] toks, String[] whitespaces, TokenizerFactory tokenizerFactory)
Copyright © 2019 Alias-i, Inc.. All rights reserved.