public class TokenSuffixArray extends Object
TokenSuffixArray implements a suffix array of tokens.
See CharSuffixArray for a description of suffix arrays
and their applications.
If the maximum length is less than the length of the array, strings are truncated to be at most this length before comparison. The result isn't a standard, fully sorted suffix array, but can be faster to create and will suffice for many applications. The indexes will be sorted relative to the truncated strings, so they will be in order up to the specified length.
Thus if the tokenization corresponds to multiple documents, the boundary token should be used to separate them.
CharSuffixArray for details and an
example.
| Modifier and Type | Field and Description |
|---|---|
static String |
DEFAULT_DOCUMENT_BOUNDARY_TOKEN
The default boundary token for documents.
|
| Constructor and Description |
|---|
TokenSuffixArray(Tokenization tokenization)
Construct at token suffix array with no limit on suffix length
and the default document-boundary token.
|
TokenSuffixArray(Tokenization tokenization,
int maxSuffixLength)
Construct a suffix array from the specified tokenization, comparing
suffixes using up the specified maximum suffix length using the
default document-boundary token.
|
TokenSuffixArray(Tokenization tokenization,
int maxSuffixLength,
String documentBoundaryToken)
Construct a suffix array from the specified tokenization, comparing
suffixes using up the specified maximum suffix length using the
default document-boundary token.
|
| Modifier and Type | Method and Description |
|---|---|
String |
documentBoundaryToken()
Returns the token used to separate documents in this suffix
array.
|
int |
maxSuffixLength()
Returns the maximum suffix length for this token suffix array.
|
List<int[]> |
prefixMatches(int minMatchLength)
Returns a list of maximal spans of suffix array indexes
which refer to suffixes that share a prefix of at least
the specified minimum match length.
|
String |
substring(int idx,
int maxTokens)
Returns the substring of the original string that's spanned
by the tokens starting at the specified suffix array index
and running the specified maximum number of tokens (or until
the token sequence ends).
|
int |
suffixArray(int idx)
Returns the value of the suffix array at the specified index.
|
int |
suffixArrayLength()
Returns the number of tokens in the suffix array.
|
Tokenization |
tokenization()
Returns the tokenization underlying this suffix array.
|
public static final String DEFAULT_DOCUMENT_BOUNDARY_TOKEN
public TokenSuffixArray(Tokenization tokenization)
tokenization - Tokenization on which to base the suffix
array.public TokenSuffixArray(Tokenization tokenization, int maxSuffixLength)
tokenization - Tokenization on which to base suffix array.maxSuffixLength - Maximum length of token sequences to compare.public TokenSuffixArray(Tokenization tokenization, int maxSuffixLength, String documentBoundaryToken)
tokenization - Tokenization on which to base suffix array.maxSuffixLength - Maximum length of token sequences to compare.documentBoundaryToken - Token used to separate documents.public String documentBoundaryToken()
public int maxSuffixLength()
public Tokenization tokenization()
public int suffixArray(int idx)
idx - Suffix array index.public int suffixArrayLength()
public String substring(int idx, int maxTokens)
idx - Index in suffix array of first token.maxTokens - Maximum number of tokens to include
in string.public List<int[]> prefixMatches(int minMatchLength)
minMatchLength - Minimum number of tokens required to
match.Copyright © 2016 Alias-i, Inc.. All rights reserved.