public abstract class Tokenizer extends Object implements Iterable<String>
Tokenizer serves as a base for tokenizer
implementations, which provide streams of tokens, whitespaces,
and positions.
A tokenizer acts as an iterator over both space and token
streams. The next space is returned through nextWhitespace(), and the next token through nextToken(). Some tokenizers may implement lastTokenStartPosition(), which returns the offset of the
previous token's first character in an underlying character stream.
Tokenizers implement the Iterable interface to allow
easy iteration over just the tokens using for-each loops.
The entire underlying character sequence may be reconstructed by
alternating the next whitespace and next token, beginning with the
first whitespace, until the end of both are reached. Offsets
returned by lastTokenStartPosition() are not guaranteed to
be into this sequence of characters.
Concrete subclasses must implement nextToken() to
return the next token. They may override nextWhitespace()
to return the next space string; it is implemented in this class to
return a single space Strings.SINGLE_SPACE_STRING.
Subclasses may also implement lastTokenStartPosition(),
which otherwise will throw an
UnsupportedOperationException.
| Constructor and Description |
|---|
Tokenizer()
Construct a tokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
Iterator<String> |
iterator()
Returns an iterator over the tokens remaining in this
tokenizer.
|
int |
lastTokenEndPosition()
Returns the offset of one position past the last character of
the most recently returned token (optional operation).
|
int |
lastTokenStartPosition()
Returns the offset of the first character of the most recently
returned token (optional operation).
|
abstract String |
nextToken()
Returns the next token in the stream, or
null if
there are no more tokens. |
String |
nextWhitespace()
Returns the next whitespace.
|
String[] |
tokenize()
Returns the remaining tokens in an array of strings.
|
void |
tokenize(List<? super String> tokens,
List<? super String> whitespaces)
Adds the remaining tokens and whitespaces to the specified
lists.
|
public Iterator<String> iterator()
The returned iterator is not thread safe with respect to the
underlying tokenizer. Specifically, it maintains a handle to
this tokenizer. Calls to the iterators hasNext() and
nextToken() methods call this tokenizers
nextToken() method.
public abstract String nextToken()
null if
there are no more tokens. Flushes any whitespace that has
not been returned.null if there are no
more tokens.public String nextWhitespace()
nextToken.
The default implementation in this class is to return
a single space, Strings.SINGLE_SPACE_STRING.
public int lastTokenStartPosition()
-1 if no token has been returned yet.
The position returned is relative to the beginning of the slice of the character array being tokenized, not the beginning of the array itself.
The implementation here simply throws an unsupported operation exception. Subclasses should override this method if they support character offset indexing.
-1 if not token has yet
been returned.UnsupportedOperationException - If this method is not
supported.public int lastTokenEndPosition()
-1 if no token has been
returned yet.
The position returned is relative to the beginning of the slice of the character array being tokenized, not the beginning of the array itself.
The implementation here throws an unsupported operation exception. Subclasses should override this method to support offset indexing.
-1 if not token has yet
been returned.UnsupportedOperationException - If the method is not supported.public void tokenize(List<? super String> tokens, List<? super String> whitespaces)
tokens - List to which tokens are added.whitespaces - List to which whitespaces are added.public String[] tokenize()
Copyright © 2016 Alias-i, Inc.. All rights reserved.