public class HmmChunker extends Object implements NBestChunker, ConfidenceChunker
HmmChunker uses a hidden Markov model to perform
chunking over tokenized character sequences. Instances contain a
hidden Markov model, decoder for the model and tokenizer factory.
Chunking results are available through three related methods.
The method chunk(CharSequence) and its sister method
chunk(char[],int,int) implement the Chunking
interface by returning the first-best chunking for the argument
character sequence or slice. The method nBest(char[],int,int,int) return an iterator over complete
chunkings and their joint probability estimates in descending order
of probability, with the last argument supplying an upper bound on
the number of chunkings returned. Finally, the method nBestChunks(char[],int,int,int) returns an iterator over the
chunks themselves, this time in descending order of the chunk's
conditional probability given the input (i.e. descending
confidence), with the final argument providing an upper bound on
the number of such chunks returned.
The chunker requires a hidden Markov model whose states conform to a token-by-token encoding of a chunking. This class assumes the following encoding:
The intention here is that the
Tag Description of Tokens to which it is Assigned May Follow May Precede B_XInitial (begin) token of chunk of type XE_Y, W_Y, EE_O_X, WW_O_XM_X, W_XM_XInterior (middle) token of chunk of type XB_X, M_XM_X, W_XE_XFinal (end) token of chunk of type XB_X, M_XB_Y, W_Y, BB_O_X, WW_O_YW_XToken by itself comprising a (whole) chunk of type XE_Y, W_Y, EE_O_X, WW_O_XB_Y, W_Y, BB_O_X, WW_O_YBB_O_XToken not in chunk, previous token ending chunk of type XE_X, W_XMM_O, EE_O_YMM_OToken, previous token and following token not in chunk BB_O_Y, MM_OMM_O, EE_O_YEE_O_XToken and previous token not in a chunk, following token begins chunk of type XBB_O_Y, MM_OB_X, W_XWW_O_XToken not in chunk, previous token ended a chunk, following token begins chunk of type XE_X, W_XB_Y, W_Y
X tags in the
last two columns (legal followers and preceders) match the tag
in the first column, whereas the Y tags
vary freely.
Note that this produces the following number of states:
Not all transitions between states are legal; the ones ruled out in the table above must receive zero probability estimates. The number of legal transitions is given by:numTags = (7 * numTypes) + 1
numTransitions = 5*numTypes2 + 13*numTypes + 1
By including an indication of the position in a chunk, an HMM
is able to model tokens that start and end chunks, as well as those
that fall in the middle of chunks or make up chunks on their own.
In addition, it also models tokens that precede or follow chunks of
a given type. For instance, consider the following tokenization
and tagging, with an implicit tag W_OOS for the
out-of-sentence tag:
(W_OOS)
Yestereday BB_O_OOS
afternoon MM_O
, EE_O_PER
John B_PER
J M_PER
. M_PER
Smith E_PER
traveled BB_O_PER
to EE_O_LOC
Washington W_LOC
. WW_O_OOS
(W_OOS)
First note that the person chunk John J. Smith consists
of three tokens: John with a begin-person tag,
J and . with in-person tokens, and
Smith an end-person token. In contrast, the token
Washington makes up a location chunk all by itself.
There are several flavors of tags assigned to tokens that are
not part of chunks based on the status of the surrounding tokens.
First, BB_O_OOS is the tag assigned to
Yesterday, because it is an out token that follows (an
implicit) OOS tag. That is, it's the first out token
following out-of-sentence. This allows the tag to capture the
capitalization pattern of sentence-initial tokens that are not part
of chunks. The interior token afternoon is simply
assigned MM_O; its context does not allow it to see
any surrounding chunks. At the other end of the sentence, the
final period token is assigned the tag
WW_O_OOS, because it preceds the (implicit) OOS
(out of sentence) chunk. This allows some discrimination between
sentence-final punctuation and other punctuation.
Next note that the token traveled is assigned to
the category of first tokens following person, whereas
to is assigned to the category of a final token
preceding a location. Finally note the tag MM_O
assigned to the token afternoon which appears between
two other tokens that are not part of chunks.
If taggings of this sort are required rather than chunkings, the
HMM decoder may be retrieved via getDecoder() and used
along with the tokenizer factory retrieved through getTokenizerFactory() to produce taggings.
The class CharLmHmmChunker may be used to train a
chunker using an HMM estimator such as HmmCharLmEstimator to estimate the HMM. This
estimator uses bounded character language models to estimate
emission probabilities.
| Constructor and Description |
|---|
HmmChunker(TokenizerFactory tokenizerFactory,
HmmDecoder decoder)
Construct a chunker from the specified tokenizer factory
and hidden Markov model decoder.
|
| Modifier and Type | Method and Description |
|---|---|
Chunking |
chunk(char[] cs,
int start,
int end)
Returns a chunking of the specified character slice.
|
Chunking |
chunk(CharSequence cSeq)
Returns a chunking of the specified character sequence.
|
HmmDecoder |
getDecoder()
Returns the underlying hidden Markov model decoder for this
chunker.
|
TokenizerFactory |
getTokenizerFactory()
Returns the underlying tokenizer factory for this chunker.
|
Iterator<ScoredObject<Chunking>> |
nBest(char[] cs,
int start,
int end,
int maxNBest)
Returns a size-bounded iterator over scored objects with joint
probability estimates of tags and tokens as scores and
chunkings as objects.
|
Iterator<Chunk> |
nBestChunks(char[] cs,
int start,
int end,
int maxNBest)
Returns an iterator over scored objects with conditional
probability estimates for scores and chunks as objects.
|
Iterator<ScoredObject<Chunking>> |
nBestConditional(char[] cs,
int start,
int end,
int maxNBest)
Returns a size-bounded iterator over scored objects with
conditional probability estimates of tags and tokens as scores
and chunkings as objects.
|
public HmmChunker(TokenizerFactory tokenizerFactory, HmmDecoder decoder)
See the note in the class documentation concerning caching
in the decoder. A typical application will configure the cache
of the decoder before creating an HMM chunker. See the class
documentation for HmmDecoder, as well as the method
documentation for HmmDecoder.setEmissionCache(Map) and
HmmDecoder.setEmissionLog2Cache(Map) for more
information.
tokenizerFactory - Tokenizer factory for tokenization.decoder - Hidden Markov model decoder.public HmmDecoder getDecoder()
The decoder provides access to the underlying hidden Markov model for this chunker.
public TokenizerFactory getTokenizerFactory()
public Chunking chunk(char[] cs, int start, int end)
chunk in interface Chunkercs - Array of characters.start - Index of first character.end - Index of one past last character.IndexOutOfBoundsException - If the specified indices are out
of bounds of the specified character array.public Chunking chunk(CharSequence cSeq)
public Iterator<ScoredObject<Chunking>> nBest(char[] cs, int start, int end, int maxNBest)
nBest in interface NBestChunkercs - Array of characters.start - Index of first character.end - Index of one past last character.maxNBest - Maximum number of results to return.IndexOutOfBoundsException - If the specified indices are out
of bounds of the specified character array.IllegalArgumentException - If the maximum n-best value is
not greater than zero.public Iterator<ScoredObject<Chunking>> nBestConditional(char[] cs, int start, int end, int maxNBest)
cs - Array of characters.start - Index of first character.end - Index of one past last character.maxNBest - Maximum number of results to return.IndexOutOfBoundsException - If the specified indices are out
of bounds of the specified character array.IllegalArgumentException - If the maximum n-best value is
not greater than zero.public Iterator<Chunk> nBestChunks(char[] cs, int start, int end, int maxNBest)
false when its hasNext()
method is called.nBestChunks in interface ConfidenceChunkercs - Array of characters.start - Index of first character.end - Index of one past last character.maxNBest - Maximum number of chunks returned.IndexOutOfBoundsException - If the specified indices are out
of bounds of the specified character array.Copyright © 2016 Alias-i, Inc.. All rights reserved.