Class TokenizerME
- java.lang.Object
-
- opennlp.tools.tokenize.TokenizerME
-
- All Implemented Interfaces:
Tokenizer
public class TokenizerME extends Object
ATokenizerfor converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: http://www.cis.upenn.edu/~jcreynar.This implementation needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The
TokenizerModelclass encapsulates that model and provides methods to create it from the binary representation.A tokenizer instance is not thread-safe. For each thread, one tokenizer must be instantiated which can share one
TokenizerModelinstance to safe memory.To train a new model, the
train(ObjectStream, TokenizerFactory, TrainingParameters)method can be used.Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");- See Also:
Tokenizer,TokenizerModel,TokenSample
-
-
Field Summary
Fields Modifier and Type Field Description static PatternalphaNumericDeprecated.As of release 1.5.2, replaced byFactory.getAlphanumeric(String)static StringNO_SPLITConstant indicates no token split.static StringSPLITConstant indicates a token split.
-
Constructor Summary
Constructors Constructor Description TokenizerME(String language)Initializes aTokenizerMEby downloading a default model.TokenizerME(TokenizerModel model)Instantiates aTokenizerMEwith an existingTokenizerModel.TokenizerME(TokenizerModel model, Factory factory)Deprecated.useTokenizerFactoryto extend the Tokenizer functionality
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description double[]getTokenProbabilities()voidsetKeepNewLines(boolean keepNewLines)Switches whether to keep new lines or not.String[]tokenize(String s)Splits a string into its atomic parts.Span[]tokenizePos(String d)Tokenizes the string.static TokenizerModeltrain(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams)Trains a model for theTokenizerME.booleanuseAlphaNumericOptimization()
-
-
-
Field Detail
-
SPLIT
public static final String SPLIT
Constant indicates a token split.- See Also:
- Constant Field Values
-
NO_SPLIT
public static final String NO_SPLIT
Constant indicates no token split.- See Also:
- Constant Field Values
-
alphaNumeric
@Deprecated public static final Pattern alphaNumeric
Deprecated.As of release 1.5.2, replaced byFactory.getAlphanumeric(String)Alpha-Numeric Pattern
-
-
Constructor Detail
-
TokenizerME
public TokenizerME(String language) throws IOException
Initializes aTokenizerMEby downloading a default model.- Parameters:
language- The language of the tokenizer.- Throws:
IOException- Thrown if the model cannot be downloaded or saved.
-
TokenizerME
public TokenizerME(TokenizerModel model)
Instantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
model- TheTokenizerModelto be used.
-
TokenizerME
@Deprecated public TokenizerME(TokenizerModel model, Factory factory)
Deprecated.useTokenizerFactoryto extend the Tokenizer functionality
-
-
Method Detail
-
getTokenProbabilities
public double[] getTokenProbabilities()
- Returns:
- the probabilities associated with the most recent calls to
Tokenizer.tokenize(String)ortokenizePos(String). If not applicable an empty array is returned.
-
tokenizePos
public Span[] tokenizePos(String d)
Tokenizes the string.- Parameters:
d- The string to be tokenized.- Returns:
- A
Spanarray containing individual tokens as elements.
-
train
public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws IOException
Trains a model for theTokenizerME.- Parameters:
samples- The samples used for the training.factory- ATokenizerFactoryto get resources from.mlParams- The machine learningtrain parameters.- Returns:
- A trained
TokenizerModel. - Throws:
IOException- Thrown during IO operations on a temp file which is created during training. Or if reading from theObjectStreamfails.
-
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()
- Returns:
trueif the tokenizer uses alphanumeric optimization,falseotherwise.
-
tokenize
public String[] tokenize(String s)
Description copied from interface:TokenizerSplits a string into its atomic parts.
-
setKeepNewLines
public void setKeepNewLines(boolean keepNewLines)
Switches whether to keep new lines or not.- Parameters:
keepNewLines-Trueif new lines are kept,falseotherwise.
-
-