public class IndoEuropeanTokenizerFactory extends Object implements TokenizerFactory, Serializable
IndoEuropeanTokenizerFactory creates tokenizers
with built-in support for alpha-numerics, numbers, and other
common constructs in Indo-European langauges.
The tokenization rules are roughly based on those used in MUC-6, but are necessarily finer grained, because the MUC tokenizers were based on lexical and semantic information such as whether a string was an abbreviation.
A token is any sequence of characters satisfying one of the following patterns.
Whitespaces are defined as any sequence of whitespace characters, including the unicode non-breakable space (unicode
Pattern Description AlphaNumeric Any sequence of upper or lowercase letters or digits, as defined by Character.isDigit(char)andCharacter.isLetter(char), and including the Devanagari characters (unicode0x0900to0x097F)Numerical Any sequence of numbers, commas, and periods. Hyphen Sequence Any number of hyphens ( -)Equals Sequence Any number of equals signs ( =)Double Quotes Double forward quotes ( ``) or double backward quotes('')
160). The tokenizer operates in a longest-leftmost
fashion, returning the longest possible token starting at the
current position in the underlying character array.
INSTANCE. There is no public constructor provided.
The serialized versions of this class deserialize to the
same singleton as produced by INSTANCE.
| Modifier and Type | Field and Description |
|---|---|
static IndoEuropeanTokenizerFactory |
INSTANCE
The singleton instance of an Indo-European tokenizer factory.
|
| Constructor and Description |
|---|
IndoEuropeanTokenizerFactory()
Construct a tokenizer for Indo-European languages.
|
| Modifier and Type | Method and Description |
|---|---|
Tokenizer |
tokenizer(char[] ch,
int start,
int length)
Returns a tokenizer for Indo-European for the specified
subsequence of characters.
|
String |
toString()
Returns tha name of this class.
|
public static final IndoEuropeanTokenizerFactory INSTANCE
public IndoEuropeanTokenizerFactory()
Implementation Note: All Indo-European tokenizer
factories behave the same way, and they are thread safe, so the
constant INSTANCE may be used anywhere a freshly
constructed character tokenizer factory is used, without loss
of performance.
public Tokenizer tokenizer(char[] ch, int start, int length)
tokenizer in interface TokenizerFactorych - Characters to tokenize.start - Index of first character to tokenize.length - Number of characters to tokenize.Copyright © 2019 Alias-i, Inc.. All rights reserved.