public class Tokenization extends Object implements Serializable
Tokenization represents the result of tokenizing a
string. Tokenizations are constructed from a character sequence
and a tokenizer factory. A tokenization contains the underlying
text, tokens, and token start/end positions in the text.
Hash codes are consistent with equality. They only depend on the text and number of tokens.
| Constructor and Description |
|---|
Tokenization(char[] cs,
int start,
int length,
TokenizerFactory factory)
Construct a tokenization from the specified text and tokenizer
factory.
|
Tokenization(String text,
List<String> tokens,
List<String> whitespaces,
int[] tokenStarts,
int[] tokenEnds)
Construct a tokenization from the specified components.
|
Tokenization(String text,
TokenizerFactory factory)
Construct a tokenization from the specified text and tokenizer
factory.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
equals(Object that)
Returns
true if the specified object is a tokenization
that is equal to this one. |
int |
hashCode()
Returns the hash code for this tokenization.
|
int |
numTokens()
Return the number of tokens in this tokenization.
|
String |
text()
Return the underlying text for this tokenization.
|
String |
token(int n)
Return the token at the specified input position.
|
int |
tokenEnd(int n)
Return the position of one past the last character in the
specified input position.
|
List<String> |
tokenList()
Returns an unmodifiable view of the list of tokens
for this tokenization.
|
String[] |
tokens()
Returns the array of tokens underlying this tokenization.
|
int |
tokenStart(int n)
Return the position of the first character in the specified
input position.
|
String |
whitespace(int n)
Return the whitespace before the token at the specified
input position, or the last whitespace if the specified
position is the number of tokens.
|
List<String> |
whitespaceList()
Returns an unmodifiable view of the list of whitespaces
for this tokenization.
|
String[] |
whitespaces()
Return the array of whitespaces for this tokenization.
|
public Tokenization(char[] cs,
int start,
int length,
TokenizerFactory factory)
cs - Underlying character array.start - Index of first character in slice.length - Length of slice.factory - Tokenizer factory to use for tokenization.IndexOutOfBoundsException - If the start and length
indices are outside of bounds of the array.public Tokenization(String text, TokenizerFactory factory)
text - Underlying text for tokenization.factory - Tokenizer factory to perform tokenization.public Tokenization(String text, List<String> tokens, List<String> whitespaces, int[] tokenStarts, int[] tokenEnds)
text - Underlying text.tokens - List of tokens.whitespaces - List of whitespaces.tokenStarts - Offset of first character in tokens.tokenEnds - Offset of last character plus one in tokens.IllegalArgumentException - If the number of whitespaces is not
equal to the number of tokens plus one, a tokens start occurs after
a token end, or a token start or end is out of bounds for the text.public String text()
public int numTokens()
public String token(int n)
n - Position of token.IndexOutOfBoundsException - If the position is less than 0 or
greater than or equal to the number of tokens.public String whitespace(int n)
n - Position of token.IndexOutOfBoundsException - If the position is less than 0
or greater than the number of tokens.public int tokenStart(int n)
n - Position of token.IndexOutOfBoundsException - If the position is less than 0 or
greater than or equal to the number of tokens.public int tokenEnd(int n)
n - Position of token.IndexOutOfBoundsException - If the position is less than 0 or
greater than or equal to the number of tokens.public String[] tokens()
The array is copied from the underlying list of tokens, so modifying it will not affect this tokenization.
public String[] whitespaces()
The array is copied from the underlying list of tokens, so modifying it will not affect this tokenization.
public List<String> tokenList()
public List<String> whitespaceList()
public boolean equals(Object that)
true if the specified object is a tokenization
that is equal to this one. Equality is defined as having the
same text, tokens, whitespaces, and token start and end positions.Copyright © 2019 Alias-i, Inc.. All rights reserved.