public class TrainTokenShapeChunker extends Object implements ObjectHandler<Chunking>, Compilable
TrainTokenShapeChunker is used to train a token and
shape-based chunker.
Estimation is based on a joint model of tags
T1,...,TN and tokens W1,...,WN, which is
approximated with a limited history and smoothed using linear
interpolation.
By the chain rule:
P(W1,...,WN,T1,...TN)
= P(W1,T1) * P(W2,T2|W1,T1) * P(W3,T3|W1,W2,T1,T2)
* ... * P(WN,TN|W1,...,WN-1,T1,...,TN-1)
The longer contexts are approximated with the two previous
tokens and one previous tag.
P(WN,TN|W1,...,WN-1,T1,...,TN-1)
~ P(WN,TN|WN-2,WN-1,TN-1)
The shorter contexts are padded with tags and tokens for the
beginning of a stream, and an addition end-of-stream symbol is
trained after the last symbol in the input.
The joint model is further decomposed into a conditional tag model
and a conditional token model by the chain rule:
P(WN,TN|WN-2,WN-1,TN-1)
= P(TN|WN-2,WN-1,TN-1)
* P(WN|WN-2,WN-1,TN-1,TN)
The token model is further approximated as:
P(WN|WN-2,WN-1,TN-1,TN)
~ P(WN|WN-1,interior(TN-1),TN)
where interior(TN-1) is the interior
version of a tag; for instance:
interior("ST_PERSON").equals("PERSON")
interior("PERSON").equals("PERSON")
This performs what is known as "model tying", and it
amounts to sharing the models for the two contexts.
The tag model is also approximated by tying start
and interior tag histories:
P(TN|WN-2,WN-1,TN-1)
~ P(TN|WN-2,WN-1,interior(TN-1))
The tag and token models are themselves simple
linear interpolation models, with smoothing parameters defined
by the Witten-Bell method. The order
of contexts for the token model is:
P(WN|TN,interior(TN-1),WN-1)
~ lambda(TN,interior(TN-1),WN-1) * P_ml(WN|TN,interior(TN-1),WN-1)
+ (1-lambda(")) * P(WN|TN,interior(TN-1))
P(WN|TN,interior(TN-1))
~ lambda(TN,interior(TN-1)) * P_ml(WN|TN,interior(TN-1))
+ (1-lambda(")) * P(WN|TN)
P(WN|TN) ~ lambda(TN) * P_ml(WN|TN)
+ 1-lambda(") * UNIFORM_ESTIMATE
The last step is degenerate in that SUM_W P(W|T) =
INFINITY, because there are infinitely many possible tokens,
and each is assigned the uniform estimate. To fix this, a model
would be needed of character sequences that ensured SUM_W
P(W|T) = 1.0. (The steps to do the final uniform estimate
are handled by the compiled estimator.)
The tag estimator is smoothed by:
P(TN|interior(TN-1),WN-1,WN-2)
~ lambda(interior(TN-1),WN-1,WN-2) * P_ml(TN|interior(TN-1),WN-1,WN-2)
+ (1-lambda(")) * P(TN|interior(TN-1),WN-1)
P(TN|interior(TN-1),WN-1)
~ lambda(interior(TN-1),WN-1) * P_ml(TN|interior(TN-1),WN-1)
+ (1-lambda(")) * P_ml(TN|interior(TN-1))
Note that the smoothing stops at estimating a tag in terms
of the previous tags. This guarantees that only bigram tag
sequences seen in the training data get non-zero probability
under the estimator.
|
Sequences of training pairs are added via handle(Chunking) method.
| Constructor and Description |
|---|
TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory)
Construct a trainer for a token/shape chunker based on
the specified token categorizer and tokenizer factory.
|
TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory,
int knownMinTokenCount,
int minTokenCount,
int minTagCount)
Construct a trainer for a token/shape chunker based on
the specified token categorizer, tokenizer factory and
numerical parameters.
|
| Modifier and Type | Method and Description |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compiles a chunker based on the training data received by
this trainer to the specified object output.
|
void |
handle(Chunking chunking)
Add the specified chunking as a training event.
|
public TrainTokenShapeChunker(TokenCategorizer categorizer, TokenizerFactory factory)
4.0, the
number of tokens to 3,000,000, the
known minimum token count to 8, and the min tag and
token count for pruning to 1.categorizer - Token categorizer for unknown tokens.factory - Tokenizer factory for creating tokenizers.public TrainTokenShapeChunker(TokenCategorizer categorizer, TokenizerFactory factory, int knownMinTokenCount, int minTokenCount, int minTagCount)
categorizer - Token categorizer for unknown tokens.factory - Tokenizer factory for tokenizing data.knownMinTokenCount - Number of instances required for
a token to count as known for unknown training.minTokenCount - Minimum token count for token contexts to
survive after pruning.minTagCount - Minimum count for tag contexts to survive
after pruning.public void handle(Chunking chunking)
handle in interface ObjectHandler<Chunking>chunking - Chunking for training.public void compileTo(ObjectOutput objOut) throws IOException
compileTo in interface CompilableobjOut - Object output to which the chunker is written.IOException - If there is an underlying I/O error.Copyright © 2016 Alias-i, Inc.. All rights reserved.