public class ChainCrfChunker extends Object implements Chunker, ConfidenceChunker, NBestChunker, Serializable
ChainCrfChunker implements chunking based on a chain CRF
over string sequences, a tokenizer factory, and a tag to chunk
coder/decoder.
The tokenizer factory is used to turn an input sequence into a list of tokens. The codec is used to convert taggings into chunkings and vice-versa.
For chunking, feature extraction is over the same two implicit
data structures as for chain CRFs, nodes and edges. For chunkers,
the labels are coded and decoded by an instance of TagChunkCodec, such as the BIO-based codec. In order to generate
token-based representations on which to hang tags, an instance of
TokenizerFactory is supplied in the chunker constructor.
estimate() method is used to train a chain
CRF-based chunker. The training data is provided as a corpus of
chunkings. The tag-chunk codec and tokenizer factory are then used
to convert the chunkings to taggings, and the resulting tag corpus
passed off to the chain CRF estimator method. Feature extractors
are the same as for a chain CRF, with one for nodes and one for
edges. The tags passed in to these feature extractors will be
determiend by the tag-chunk codec. The remaining inputs are
identical to those for chain CRFs; see the method documentation for
more information.
ChainCrfChunker, with components
derived from serialization and deserialization.
| Constructor and Description |
|---|
ChainCrfChunker(ChainCrf<String> crf,
TokenizerFactory tokenizerFactory,
TagChunkCodec codec)
Construct a chunker based on the specified chain conditional
random field, tokenizer factory and tag-chunk coder/decoder.
|
| Modifier and Type | Method and Description |
|---|---|
Chunking |
chunk(char[] cs,
int start,
int end)
Return the chunking of the specified character slice.
|
Chunking |
chunk(CharSequence cSeq)
Return the chunking of the specified character sequence.
|
TagChunkCodec |
codec()
Returns the tag/chunk coder/decoder for this chunker.
|
ChainCrf<String> |
crf()
Returns the underlying CRF for this chunker.
|
static ChainCrfChunker |
estimate(Corpus<ObjectHandler<Chunking>> chunkingCorpus,
TagChunkCodec codec,
TokenizerFactory tokenizerFactory,
ChainCrfFeatureExtractor<String> featureExtractor,
boolean addInterceptFeature,
int minFeatureCount,
boolean cacheFeatureVectors,
RegressionPrior prior,
int priorBlockSize,
AnnealingSchedule annealingSchedule,
double minImprovement,
int minEpochs,
int maxEpochs,
Reporter reporter)
Return the chain CRF-based chunker estimated from the specified
corpus, which is converted to a tagging corpus using the
specified coder/decoder and tokenizer factory, then passed to
the chain CRF estimate method along with the rest of the
arguments.
|
Iterator<ScoredObject<Chunking>> |
nBest(char[] cs,
int start,
int end,
int maxResults)
Return the scored chunkings of the specified character sequence
in order as an iterator in order of score.
|
Iterator<Chunk> |
nBestChunks(char[] cs,
int start,
int end,
int maxNBest)
Returns the n-best chunks in decreasing order of probability
estimates.
|
Iterator<ScoredObject<Chunking>> |
nBestConditional(char[] cs,
int start,
int end,
int maxResults)
Returns an iterator over n-best chunkings with scores
normalized to conditional probabilities of the output given the
input string slice.
|
TokenizerFactory |
tokenizerFactory()
Return the tokenizer factory for this chunker.
|
String |
toString()
Return a string-based representation of this CRF chunker.
|
public ChainCrfChunker(ChainCrf<String> crf, TokenizerFactory tokenizerFactory, TagChunkCodec codec)
crf - Underlying conditional random field.tokenizerFactory - Tokenizer factory for converting chunkings
to token sequences.codec - Coder/decoder for converting taggings to chunkings
and vice-versa.public ChainCrf<String> crf()
public TagChunkCodec codec()
public TokenizerFactory tokenizerFactory()
public String toString()
public Chunking chunk(CharSequence cSeq)
Chunkerpublic Chunking chunk(char[] cs, int start, int end)
Chunkerpublic Iterator<ScoredObject<Chunking>> nBest(char[] cs, int start, int end, int maxResults)
NBestChunkernBest in interface NBestChunkercs - Underlying character array.start - Index of first character to analyze.end - Index of one past the last character to analyze.maxResults - The maximum number of results to return.npublic Iterator<ScoredObject<Chunking>> nBestConditional(char[] cs, int start, int end, int maxResults)
nBest(char[],int,int,int). Like that method, the maximum number
of results parameter should be set as low as practical, as it
cuts down on memory requirement for outputs that will never be
returned.
Conditional probability normalization requires an additional forward-backward pass to derive the normalizing factor, but the benefit is that results become comparable across input strings.
cs - Underlying characters.start - First character in slice.end - One past the last character in the slice.maxResults - Maximum number of results to return.public Iterator<Chunk> nBestChunks(char[] cs, int start, int end, int maxNBest)
ConfidenceChunkerChunk
interface, and their scores are conditional probability
estimates of the chunk given the input character slice.nBestChunks in interface ConfidenceChunkercs - Underlying character array.start - Index of first character to analyze.end - Index of one past the last character to analyze.maxNBest - The maximum number of chunks to return.public static ChainCrfChunker estimate(Corpus<ObjectHandler<Chunking>> chunkingCorpus, TagChunkCodec codec, TokenizerFactory tokenizerFactory, ChainCrfFeatureExtractor<String> featureExtractor, boolean addInterceptFeature, int minFeatureCount, boolean cacheFeatureVectors, RegressionPrior prior, int priorBlockSize, AnnealingSchedule annealingSchedule, double minImprovement, int minEpochs, int maxEpochs, Reporter reporter) throws IOException
Estimation is based on regularized stochastic gradient
descent. See ChainCrf.estimate(Corpus,ChainCrfFeatureExtractor,boolean,int,boolean,boolean,RegressionPrior,int,AnnealingSchedule,double,int,int,Reporter)
for more information.
chunkingCorpus - Training corpus of chunkings.codec - Coder/decoder for translating chunkings to
taggings and vice-versa.tokenizerFactory - Tokenizer factory for converting inputs to
token sequences for the underlying chain CRF.featureExtractor - Feature extractor for the underlying chain CRF.addInterceptFeature - Set to true to automatically add an
intercept feature with constant value 1.0 in position 0.minFeatureCount - Minimum number of times a feature must
show up in the tagging corpus given the feature extractors to
be retained for training.cacheFeatureVectors - Flag indicating whether or not to cache
extracted features.prior - Prior to use to regularize the underlying chain
CRF estimates.priorBlockSize - Number of instances to update by gradeint
for every prior update.annealingSchedule - Annealing schedule to determine
learning rates for stochastic gradient descent training.minImprovement - Minimum improvement in epoch to terminate training (computed
with a rolling average).minEpochs - Minimum number of epochs for which to train.maxEpochs - Maximum nubmer of epochs for which to train.reporter - Reporter to which reports of training are sent, or
null for silent operation.IOException - If there is an underlying I/O exception reading
the corpus.Copyright © 2019 Alias-i, Inc.. All rights reserved.