public class ExactDictionaryChunker extends Object implements Chunker
All dictionary entry categories are converted to strings from
generic objects using Object.toString().
An exact dicitonary chunker may be configured either to
extract all matching chunks, or to restrict the results to a
consistent set of non-overlapping chunks. These non-overlapping
chunks are taken to be the left-most, longest-matching,
highest-scoring, or alphabetically preceding in type according
to the following definitions. A chunk with span
(start1,end1) overlaps a chunk with span
(start2,end2) if and only if either end
points of the second chunk lie within the first chunk:
start1 <= start2 < end1, or
start1 < end2 <= end1.
(0,1) and (1,3) do
not overlap, but
(0,1) overlaps (0,2),
(1,2) overlaps (0,2), and
(1,7) overlaps (2,3).
A chunk chunk1=(start1,end1):type1@score1 dominates
another chunk chunk2=(start2,end2):type2@score2 if and
only if the chunks overlap and:
start1 < start2 (leftmost), or
start1 == start2
and end1 > end2 (longest), or
start1 == start2, end1 == end2
and score1 > score2 (highest scoring), or
start1 == start2, end1 == end2,
score1 == score2 and
type1 < type2 (alphabetical).
If the chunker is specified to be case sensitive, the exact
dictionary entries must match. If it is not case sensitive, all
matching will be done after applying string normalization using
String.toLowerCase().
Matching ignores whitespace. The tokenizer factory should provide accurate start and end token positions as these will be used to determine the chunks.
Chunking is thread safe, and may be run concurrently. Changing
the return-all-matches flag with setReturnAllMatches(boolean) should not be called while chunking
is running, as it may affect the behavior of the running example
with respect to whether it returns all chunkings. Once
constructed, the tokenizer's behavior should not change.
Implementation Note: This class is implemented using the Aho-Corasick algorithm, a generalization of the Knuth-Morris-Pratt string-matching algorithm to sets of strings. Aho-Corasick is linear in the number of tokens in the input plus the number of output chunks. Memory requirements are only an array of integers as long as the longest phrase (a circular queue for holding start points of potential chunks) and the memory required by the chunking implementation for the result (which may be as large as quadratic in the size of the input, or may be very small if there are not many matches). Compilation of the Aho-Corasick tree is done in the constructor and is linear in number of dictionary entries with a constant factor as high as the maximum phrase length; this can be improved to a constant factor using suffix-tree like speedups, but it didn't seem worth the complexity here when the dictionaries would be long-lived.
| Constructor and Description |
|---|
ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory)
Construct an exact dictionary chunker from the specified
dictionary and tokenizer factory which is case sensitive and
returns all matches.
|
ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory,
boolean returnAllMatches,
boolean caseSensitive)
Construct an exact dictionary chunker from the specified
dictionary and tokenizer factory, returning all matches or not
as specified.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
caseSensitive()
Returns
true if this dictionary chunker is
case sensitive. |
Chunking |
chunk(char[] cs,
int start,
int end)
Returns the chunking for the specified character slice.
|
Chunking |
chunk(CharSequence cSeq)
Returns the chunking for the specified character sequence.
|
boolean |
returnAllMatches()
Returns
true if this chunker returns all matches. |
void |
setReturnAllMatches(boolean returnAllMatches)
Set whether to return all matches to the specified condition.
|
TokenizerFactory |
tokenizerFactory()
Returns the tokenizer factory underlying this chunker.
|
String |
toString()
Returns a string-based representation of this chunker.
|
public ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory)
After construction, this class does not use the dictionary and will not be sensitive to changes in the underlying dictionary.
dict - Dictionary forming the basis of the chunker.factory - Tokenizer factory underlying chunker.public ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory, boolean returnAllMatches, boolean caseSensitive)
After construction, this class does not use the dictionary and will not be sensitive to changes in the underlying dictionary.
Case sensitivity is defined using Locale.ENGLISH. For other languages, underlying case
sensitivity must be defined externally by passing in
case-normalized text.
dict - Dictionary forming the basis of the chunker.factory - Tokenizer factory underlying chunker.returnAllMatches - true if chunker should return
all matches.caseSensitive - true if chunker is case
sensitive.public TokenizerFactory tokenizerFactory()
public boolean caseSensitive()
true if this dictionary chunker is
case sensitive. Case sensitivity must be defined at
construction time and may not be reset.public boolean returnAllMatches()
true if this chunker returns all matches.public void setReturnAllMatches(boolean returnAllMatches)
Note that setting this while running a chunking in another thread may affect that chunking.
returnAllMatches - true if all matches should
be returned.public Chunking chunk(CharSequence cSeq)
public Chunking chunk(char[] cs, int start, int end)
public String toString()
Copyright © 2016 Alias-i, Inc.. All rights reserved.