public class DocumentTokenSuffixArray extends Object
DocumentTokenSuffixArray implements a suffix array over a
collection of named documents.
The documents are concatenated with a specified distinguished token as a separator. The separator acts as an end-of-document marker that terminates comparisons.
A document suffix array is constructed from a mapping of identifiers to documents. A tokenizer factory and separator are also provided.
The underlying suffix array may be retrieved using suffixArray() and manipulated as any other token-based suffix
array. The method textPositionToDocId(int) provides
the means to map a position in the underlying token array to
the document that spans the positions.
| Constructor and Description |
|---|
DocumentTokenSuffixArray(Map<String,String> idToDocMap,
TokenizerFactory tf,
int maxSuffixLength,
String documentBoundaryToken)
Construct a suffix array from the specified identified document
collection using the specified tokenizer factory, limiting comparisons
to the specified maximum suffix length and separating documents with
the specified boundary token.
|
| Modifier and Type | Method and Description |
|---|---|
int |
docEndToken(String docId)
Returns the index of the next token past the last token of the
specified document.
|
int |
docStartToken(String docId)
Returns the starting token position in the underlying token
suffix array of the document with the specified identifier in
the overall set of documents.
|
Set<String> |
documentNames()
Returns an unmodifiable view of the set of document names in
the collection.
|
String |
documentText(String docName)
Return the text of the document with the specified name.
|
static int |
largestWithoutGoingOver(int[] vals,
int val)
Given an increasing array of values and a specified value,
return the largest index into the array such that the array's
value at the index is smaller than or equal to the specified
value.
|
int |
numDocuments()
Returns the number of documents in the collection.
|
TokenSuffixArray |
suffixArray()
Return the token suffix array backing this document suffix
array.
|
String |
textPositionToDocId(int textPosition)
Return the identifier of the document that contains
the specified position in the underlying text.
|
public DocumentTokenSuffixArray(Map<String,String> idToDocMap, TokenizerFactory tf, int maxSuffixLength, String documentBoundaryToken)
For this class to work properly, the tokenizer factory must tokenize the document boundary token into a single token when surrounded by spaces.
idToDocMap - Mapping from document identifiers to document
texts.tf - Tokenizer factory to use for matching.maxSuffixLength - Maximum suffix length (in tokens) for
comparsions.documentBoundaryToken - Distinguished token used to separate
documents.IllegalArgumentException - If the tokenizer factory does not
tokenize the document boundary token surrounded by single whitespaces
into a single token consisting of the boundary token.
// raise exception if find boundary in tokens of doc?public TokenSuffixArray suffixArray()
public String textPositionToDocId(int textPosition)
textPosition - Position in underlying list of concatenated
documents.public String documentText(String docName)
docName - Name of document.NullPointerException - If the document name is not known.public int numDocuments()
public Set<String> documentNames()
public int docStartToken(String docId)
-1 if the
document is not part of the collection.docId - Document identifier.public int docEndToken(String docId)
-1 if the document is not
part of the collection.docId - Document identifier.public static int largestWithoutGoingOver(int[] vals,
int val)
Warning: No test is made that the values are in increasing order. If they are not, the behavior of this method is not specified.
vals - Array of values, sorted in ascending order.val - Specified value to search.Copyright © 2016 Alias-i, Inc.. All rights reserved.