public class HmmDecoder extends Object implements Tagger<String>, NBestTagger<String>, MarginalTagger<String>
HmmDecoder provides implementations of first-best,
n-best and marginal taggers for hidden Markov models (HMMs). A
decoder is constructed from a hidden Markov model.
Tagger, which
specifies a first-best tagging method tag(List).
This method provides the likely (first best) path of HMM states
(tags) given a sequence of string emissions (outputs). First-best
decoding is implemented using Viterbi's dynamic programming
algorithm.
HMM decoders also implement the interface NBestTagger,
which specifies the method tagNBest(List,int) and tagNBestConditional(List,int). These methods both return an
iterator over scored taggings. N-best decoding is implemented
using the Viterbi algorithm in a forward pass and the A*
algorithm in the backward pass using the Viterbi estimates as exact
completion estimates. The variant conditional method further
normalizes outputs to posterior conditional probabilities, and is
a bit more expensive to compute.
HMM decoders also implement the MarginalTagger
interface, which specifies a method tagMarginal(List) for
providing marginal probability estimates for categories for a token
given the input string. Marginal decoding is implemented using the
standard forward-backward algorithm. The lattice is an instance of
TagLattice, which itself implements TagLattice;
see that class's documentation for information on how to retrieve
cumulative (total) probabilities for input token sequences and
posterior conditional probabilities (confidence scores) per token.
The decoder is able to cache emission probabilities, the
computation of which is often the bottleneck in decoding. Caches
may be provided in the constructor for both ordinary (linear)
probabilities and log probabilities of emissions. A cache is
simply an instance of Map from strings to arrays of doubles.
The first-best and n-best decoders only uses log probabilities, whereas the n-best normalized and lattice decoders use linear probabilities. Only the probabilities computed are cached, so if a program only does first-best processing, only linear estimates are cached.
Any implementation of Map may be used as a cache,
but particular attention must be paid to thread safety and
scalability. A fully synchronized cache can be created with:
Map<String,double[]> cache
= java.util.Collections.synchronizedMap(new HashMap<String,double[]>());
LingPipe's map implementation FastCache is
designed specifically to be used as a cache in settings such as
these.
It is often (e.g. on English newsire) easy to get high token coverage (e.g. 97%) with a rather modestly sized cache (e.g. 100K tokens). Other corpora and languages may vary and we encourage experimentation with efficiency versus memory for caching. Note that run times will speed up as more and more estimates are returned from the cache rather than being computed directly.
This class does not perform any underlying sychronization. If
the hidden Markov model is not thread safe, then it must be
synchronized. Similarly for the caches. Note that FastCache, while not synchronized, is thread safe.
Similarly, the compilation of an HMM trained with HmmCharLmEstimator is thread safe, in fact allowing safe
concurrent access because it is immutable.
For first-best and n-best decoding, a beam may be used to prune
unlikely hypotheses. This beam is set during construction or
through the method setLog2EmissionBeam(double) (setting
and access must be concurrent read/exclusive write synchronized
from the caller). The beam works token by token. As each token is
considered, any tag whose emission log (base 2) likelihood is more
than the beam less than the bes5t emission estimate is eliminated
from further consideration.
| Constructor and Description |
|---|
HmmDecoder(HiddenMarkovModel hmm)
Construct an HMM decoder from the specified HMM.
|
HmmDecoder(HiddenMarkovModel hmm,
Map<String,double[]> emissionCache,
Map<String,double[]> emissionLog2Cache)
Construct an HMM decoder from the specified HMM using the
specified caches for linear and log probabilities.
|
HmmDecoder(HiddenMarkovModel hmm,
Map<String,double[]> emissionCache,
Map<String,double[]> emissionLog2Cache,
double log2Beam,
double log2EmissionBeam)
Construct an HMM decoder from the specified HMM using the
specified caches for linear and log probabilities, with the
specified beam width for emission estimates.
|
| Modifier and Type | Method and Description |
|---|---|
Map<String,double[]> |
emissionCache()
Returns the mapping used to cache emission probabilities, or
null if not caching. |
Map<String,double[]> |
emissionLog2Cache()
Returns the mapping used to cache log (base 2) emission
probabilities, or
null if not caching. |
HiddenMarkovModel |
getHmm()
Returns the hidden Markov model underlying this decoder.
|
void |
setEmissionCache(Map<String,double[]> cache)
Sets the emission cache to the specified value.
|
void |
setEmissionLog2Cache(Map<String,double[]> cache)
Sets the log emission cache to the specified value.
|
void |
setLog2Beam(double log2Beam)
Sets the value of the log2 beam to the specified value.
|
void |
setLog2EmissionBeam(double log2EmissionBeam)
Sets the log (base 2) emission beam width.
|
Tagging<String> |
tag(List<String> tokens)
Return the tagging for the specified list of tokens.
|
TagLattice<String> |
tagMarginal(List<String> tokens)
Return the marginal tagging for the specified list of
input tokens.
|
Iterator<ScoredTagging<String>> |
tagNBest(List<String> tokens,
int maxResults)
Return an iterator over the
n-best scored taggings for
the specified input tokens up to a specified maximum n. |
Iterator<ScoredTagging<String>> |
tagNBestConditional(List<String> tokens,
int maxResults)
Return an iterator over the
n-best scored taggings for
the specified input tokens up to a specified maximum n,
with scores normalized to conditional probabilities. |
public HmmDecoder(HiddenMarkovModel hmm)
hmm - Model to use as basis of decoding.public HmmDecoder(HiddenMarkovModel hmm, Map<String,double[]> emissionCache, Map<String,double[]> emissionLog2Cache)
null, in which case the
corresponding values will not be cached.hmm - Model to use for decoding.emissionCache - Map to use for emission caching.emissionLog2Cache - Map to use for log emission caching.public HmmDecoder(HiddenMarkovModel hmm, Map<String,double[]> emissionCache, Map<String,double[]> emissionLog2Cache, double log2Beam, double log2EmissionBeam)
null, in which case the
corresponding values will not be cached.hmm - Model to use for decoding.emissionCache - Map to use for emission caching.emissionLog2Cache - Map to use for log emission caching.log2Beam - The log (base 2) beam for pruning full hypotheses.log2EmissionBeam - The log (base 2) beam for pruning emission hypotheses.IllegalArgumentException - If either beam is not a
non-negative number.public HiddenMarkovModel getHmm()
public Map<String,double[]> emissionCache()
null if not caching. This is the actual mapping,
so changes to it will affect this decoder.public Map<String,double[]> emissionLog2Cache()
null if not caching. This is
the actual mapping, so changes to it will affect this decoder.public void setEmissionCache(Map<String,double[]> cache)
Warning: This method should not be executed concurrently with any calls to decoding, as it may produce an inconsistent result. The typical application will be to set a cache before using a decoder.
cache - Cache for linear emission estimates.public void setLog2EmissionBeam(double log2EmissionBeam)
log2EmissionBeam - Width of beam.IllegalArgumentException - If the beam width is negative.public void setLog2Beam(double log2Beam)
log2Beam - The log (base 2) Viterbi beam.IllegalArgumentException - If the beam width is negative.public void setEmissionLog2Cache(Map<String,double[]> cache)
Warning: This method should not be executed concurrently with any calls to decoding, as it may produce an inconsistent result. The typical application will be to set a cache before using a decoder.
cache - Cache for linear emission estimates.public Tagging<String> tag(List<String> tokens)
Taggerpublic Iterator<ScoredTagging<String>> tagNBest(List<String> tokens, int maxResults)
NBestTaggern-best scored taggings for
the specified input tokens up to a specified maximum n.tagNBest in interface NBestTagger<String>tokens - Input tokens to tag.maxResults - Maximum number of results to return.public Iterator<ScoredTagging<String>> tagNBestConditional(List<String> tokens, int maxResults)
NBestTaggern-best scored taggings for
the specified input tokens up to a specified maximum n,
with scores normalized to conditional probabilities.
Optional operation.
tagNBestConditional in interface NBestTagger<String>tokens - Input tokens to tag.maxResults - Maximum number of results to return.public TagLattice<String> tagMarginal(List<String> tokens)
MarginalTaggertagMarginal in interface MarginalTagger<String>tokens - Input tokens to tag.Copyright © 2016 Alias-i, Inc.. All rights reserved.