public class RegExTokenizerFactory extends Object implements Serializable, TokenizerFactory
RegExTokenizerFactory creates a tokenizer factory
out of a regular expression. The regular expression is presented
as an instance of Pattern and matching is carried out with
the java.util.regex package. The pattern provided when the
factory is constructed is used to create instances of Matcher for use in tokenizers. The method Matcher.find(int) is called to find the next token in an input
sequence.
For instance, consider a regular expression which takes a token to be a sequence of alphabetic characters, a sequence of numeric characters, or a single non-alphanumeric character:
[a-zA-Z]+|[0-9]+|\S
This can be used to construct a tokenizer factory:
String regex = "[a-zA-Z]+|[0-9]+|\\S";
TokenizerFactory tf = new RegExTokenizerFactory(regex);
char[] cs = "abc de 123. ".toCharArray();
Tokenizer tokenizer = tf.tokenizer(cs,0,cs.length);
Note the escaping of the backslash character (\) in
the Java string regex with a backslash
(\), resulting in \\. For the regular
expression there are no spaces within any of the disjuncts because
the matched tokens should not contain whitespaces. Finally note
the use of Kleene plus (+) rather than Kleene star
(*) to ensure that tokens are at least a single
character long. In fact, the constructor will throw an exception
if the pattern matches the empty string.
The tokenizer above will return the following tokens, whitespaces and character offsets:
whitespaces: "", " ", " ", "", " "
tokens: "abc", "de", "123", "."
token starts: 0, 4, 7, 10
A regular-expression-based tokenizer factory is completely thread safe.
A regular-expression-based tokenizer factory may be serialized.
| Constructor and Description |
|---|
RegExTokenizerFactory(Pattern pattern)
Construct a regular expression tokenizer factory with
the specified pattern for matching.
|
RegExTokenizerFactory(String regex)
Construct a regular expression tokenizer factory
using the specified regular expression for matching.
|
RegExTokenizerFactory(String regex,
int flags)
Construct a regular expression tokenizer factory using the
specified regular expression for matching according to the
specified flags.
|
| Modifier and Type | Method and Description |
|---|---|
Pattern |
pattern()
Returns the regular expression pattern backing this
tokenizer factory.
|
Tokenizer |
tokenizer(char[] cs,
int start,
int length)
Returns a tokenizer for the specified subsequence
of characters.
|
String |
toString()
Return a description of this regex-based tokenizer
factory including its pattern's regular expression
and flags.
|
public RegExTokenizerFactory(String regex)
regex - The regular expression.PatternSyntaxException - If the expression's syntax is
invalid.public RegExTokenizerFactory(String regex, int flags)
|") of the
following flags: Pattern.CASE_INSENSITIVE, Pattern.MULTILINE, Pattern.DOTALL, Pattern.UNICODE_CASE and Pattern.CANON_EQ.
See Pattern.compile(String,int) for more information.regex - The regular expression.flags - The match flags.PatternSyntaxException - If the expression's syntax is
invalid.IllegalArgumentException - If bit values other than those
corresponding to defined match flags are set in the flags.public RegExTokenizerFactory(Pattern pattern)
pattern - Pattern to use for matching.public Pattern pattern()
public Tokenizer tokenizer(char[] cs, int start, int length)
TokenizerFactorytokenizer in interface TokenizerFactorycs - Characters to tokenize.start - Index of first character to tokenize.length - Number of characters to tokenize.Copyright © 2019 Alias-i, Inc.. All rights reserved.