public class TokenizerME extends Object
 This tokenizer needs a statistical model to tokenize a text which reproduces
 the tokenization observed in the training data used to create the model.
 The TokenizerModel class encapsulates the model and provides
 methods to create it from the binary representation.
 
 A tokenizer instance is not thread safe. For each thread one tokenizer
 must be instantiated which can share one TokenizerModel instance
 to safe memory.
 
 To train a new model {train(String, ObjectStream, boolean, TrainingParameters) method
 can be used.
 
Sample usage:
 
 InputStream modelIn;
 
 ...
 
 TokenizerModel model = TokenizerModel(modelIn);
 
 Tokenizer tokenizer = new TokenizerME(model);
 
 String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
 
Tokenizer, 
TokenizerModel, 
TokenSample| Modifier and Type | Field and Description | 
|---|---|
| static Pattern | alphaNumericDeprecated. 
 As of release 1.5.2, replaced by  Factory.getAlphanumeric(String) | 
| static String | NO_SPLITConstant indicates no token split. | 
| static String | SPLITConstant indicates a token split. | 
| Constructor and Description | 
|---|
| TokenizerME(TokenizerModel model) | 
| TokenizerME(TokenizerModel model,
           Factory factory)Deprecated. 
 use  TokenizerFactoryto extend the Tokenizer
             functionality | 
| Modifier and Type | Method and Description | 
|---|---|
| double[] | getTokenProbabilities()Returns the probabilities associated with the most recent
 calls to  AbstractTokenizer.tokenize(String)ortokenizePos(String). | 
| String[] | tokenize(String s)Splits a string into its atomic parts | 
| Span[] | tokenizePos(String d)Tokenizes the string. | 
| static TokenizerModel | train(ObjectStream<TokenSample> samples,
     TokenizerFactory factory,
     TrainingParameters mlParams)Trains a model for the  TokenizerME. | 
| static TokenizerModel | train(String languageCode,
     ObjectStream<TokenSample> samples,
     boolean useAlphaNumericOptimization)Deprecated. 
 Use
     train(ObjectStream, TokenizerFactory, TrainingParameters)and pass in aTokenizerFactory | 
| static TokenizerModel | train(String languageCode,
     ObjectStream<TokenSample> samples,
     boolean useAlphaNumericOptimization,
     TrainingParameters mlParams)Deprecated. 
 Use
     train(ObjectStream, TokenizerFactory, TrainingParameters)and pass in aTokenizerFactory | 
| static TokenizerModel | train(String languageCode,
     ObjectStream<TokenSample> samples,
     Dictionary abbreviations,
     boolean useAlphaNumericOptimization,
     TrainingParameters mlParams)Deprecated. 
 Use
     train(ObjectStream, TokenizerFactory, TrainingParameters)and pass in aTokenizerFactory | 
| boolean | useAlphaNumericOptimization()Returns the value of the alpha-numeric optimization flag. | 
public static final String SPLIT
public static final String NO_SPLIT
@Deprecated public static final Pattern alphaNumeric
Factory.getAlphanumeric(String)public TokenizerME(TokenizerModel model)
public TokenizerME(TokenizerModel model, Factory factory)
TokenizerFactory to extend the Tokenizer
             functionalitypublic double[] getTokenProbabilities()
AbstractTokenizer.tokenize(String) or tokenizePos(String).public Span[] tokenizePos(String d)
d - The string to be tokenized.public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException
TokenizerME.samples - the samples used for the training.factory - a TokenizerFactory to get resources frommlParams - the machine learning train parametersTokenizerModelIOException - it throws an IOException if an IOException is
           thrown during IO operations on a temp file which is created
           during training. Or if reading from the ObjectStream
           fails.public static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization, TrainingParameters mlParams) throws IOException
train(ObjectStream, TokenizerFactory, TrainingParameters)
    and pass in a TokenizerFactoryTokenizerME.languageCode - the language of the natural textsamples - the samples used for the training.useAlphaNumericOptimization - - if true alpha numerics are skippedmlParams - the machine learning train parametersTokenizerModelIOException - it throws an IOException if an IOException
 is thrown during IO operations on a temp file which is created during training.
 Or if reading from the ObjectStream fails.public static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, Dictionary abbreviations, boolean useAlphaNumericOptimization, TrainingParameters mlParams) throws IOException
train(ObjectStream, TokenizerFactory, TrainingParameters)
    and pass in a TokenizerFactoryTokenizerME.languageCode - the language of the natural textsamples - the samples used for the training.abbreviations - an abbreviations dictionaryuseAlphaNumericOptimization - - if true alpha numerics are skippedmlParams - the machine learning train parametersTokenizerModelIOException - it throws an IOException if an IOException
 is thrown during IO operations on a temp file which is created during training.
 Or if reading from the ObjectStream fails.public static TokenizerModel train(String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization) throws IOException, ObjectStreamException
train(ObjectStream, TokenizerFactory, TrainingParameters)
    and pass in a TokenizerFactoryTokenizerME with a default cutoff of 5 and 100 iterations.languageCode - the language of the natural textsamples - the samples used for the training.useAlphaNumericOptimization - - if true alpha numerics are skippedTokenizerModelIOException - it throws an IOException if an IOException
 is thrown during IO operations on a temp file which isObjectStreamException - if reading from the ObjectStream fails
 created during training.public boolean useAlphaNumericOptimization()
Copyright © 2015 The Apache Software Foundation. All rights reserved.