public class TrieCharSeqCounter extends Object implements CharSeqCounter
TrieCharSeqCounter stores counts for substrings of
strings. When the counter is constructed, a maximum length is
specified, and counts are only stored for strings up to that
length. For instance, an n-gram language model needs only counts
for strings up to length n.
Strings may be added to the counter using incrementSubstrings(char[],int,int), which increments the counts
for all substrings of the specified character slice up to the
specified maximum length substring. The method incrementPrefixes(char[],int,int) increments only the prefixes of
the specified string. All substrings are incremented by
incrementing prefixes for each suffix. A substring counter may be
pruned using prune(int), which removes all substrings with
count below the specified threshold.
There are a wide range of reporting methods for trie-based counters.
Implementation Note: The trie counters are a heavily unfolded implementation of a character-based Patricia (PAT) trie.
| Constructor and Description |
|---|
TrieCharSeqCounter(int maxLength)
Construct a substring counter that stores substrings
up to the specified maximum length.
|
| Modifier and Type | Method and Description |
|---|---|
char[] |
charactersFollowing(char[] cs,
int start,
int end)
Returns the array of characters that have been observed
following the specified character slice in unicode order.
|
long |
count(char[] cs,
int start,
int end)
Returns the count for the specified character sequence.
|
long |
count(CharSequence cSeq)
Returns the count in the training corpus for the specified
sequence of characters.
|
void |
decrementSubstrings(char[] cs,
int start,
int end)
Decrements all of the substrings of the specified character
slice by one.
|
void |
decrementUnigram(char c)
Decrements the unigram count for the specified character.
|
void |
decrementUnigram(char c,
int count)
Decrements the unigram count by the specified amount for the
specified character.
|
long |
extensionCount(char[] cs,
int start,
int end)
Returns the sum of the counts of all character sequences one
character longer than the specified character slice.
|
long |
extensionCount(CharSequence cSeq)
Returns the sum of the counts of all character sequences one
character longer than the specified character sequence.
|
void |
incrementPrefixes(char[] cs,
int start,
int end)
Increments the count of all prefixes of the specified
character sequence up to the maximum length specified in the
constructor.
|
void |
incrementPrefixes(char[] cs,
int start,
int end,
int count)
Increments the count of all prefixes of the specified
character sequence up to the maximum length specified in the
constructor.
|
void |
incrementSubstrings(char[] cs,
int start,
int end)
Increments the count of all substrings of the specified
character array slice up to the maximum length specified in the
constructor.
|
void |
incrementSubstrings(char[] cs,
int start,
int end,
int count)
Increments by the specified count all substrings of the
specified character array slice up to the maximum length
specified in the constructor.
|
void |
incrementSubstrings(CharSequence cSeq)
Increments the count of all substrings of the specified
character sequence up to the maximum length specified in the
constructor.
|
void |
incrementSubstrings(CharSequence cSeq,
int count)
Increments by the specified count all substrings of the
specified character sequence up to the maximum length specified
in the constructor.
|
int[] |
nGramFrequencies(int nGramOrder)
Returns an array of frequency counts for n-grams of the
specified n-gram order sorted in descending frequency order.
|
int |
numCharactersFollowing(char[] cs,
int start,
int end)
Returns the number of characters that when appended to the end
of the specified character slice produce an extended slice with
a non-zero count.
|
char[] |
observedCharacters()
Returns an array consisting of the characters with non-zero
count in unicode order.
|
void |
prune(int minCount)
Removes strings with counts below the specified minimum.
|
static TrieCharSeqCounter |
readCounter(TrieReader reader,
int maxNGram)
Reads a trie character sequence counter from the specified
trie reader, restricting the result to the specified maximum
n-gram.
|
static TrieCharSeqCounter |
readFrom(InputStream in)
Reads a trie character sequence counter from the specified
input stream.
|
ObjectToCounterMap<String> |
topNGrams(int nGramOrder,
int maxReturn)
Returns a counter of occurrences of the highest frequency
n-grams of a specified n-gram order.
|
String |
toString()
Returns a string representation of the trie structure of counts
underlying this counter.
|
long |
totalSequenceCount()
Returns the sum of counts for all non-empty character
sequences.
|
long |
totalSequenceCount(int length)
Returns the sum of the counts of all character sequences of
the specified length.
|
long |
uniqueSequenceCount()
Returns the number of character sequences with non-zero counts,
including the empty (zero length) character sequence.
|
long |
uniqueSequenceCount(int nGramOrder)
Returns the number of character sequences of the specified length
with non-zero counts.
|
long[][] |
uniqueTotalNGramCount()
Returns the array of unique and total n-gram counts for each
n-gram length.
|
static void |
writeCounter(CharSeqCounter counter,
TrieWriter writer,
int maxNGram)
Writes the specified sequence counter to the specified trie
writer, restricting output to n-grams not longer than the
specified maximum.
|
void |
writeTo(OutputStream out)
Writes an encoding of this counter to the specified output
stream.
|
public TrieCharSeqCounter(int maxLength)
maxLength - Maximum length of substrings stored by this
counter.IllegalArgumentException - If the maximum length is
negative.public long count(char[] cs,
int start,
int end)
CharSeqCountercount in interface CharSeqCountercs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.public long extensionCount(char[] cs,
int start,
int end)
CharSeqCounterextensionCount in interface CharSeqCountercs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.public char[] observedCharacters()
CharSeqCountercharactersFollowing(new
char[0],0,0).observedCharacters in interface CharSeqCounterpublic char[] charactersFollowing(char[] cs,
int start,
int end)
CharSeqCountercharactersFollowing in interface CharSeqCountercs - Underlying character array.start - Index of first character in slice.end - One plus index of last character in slice.public int numCharactersFollowing(char[] cs,
int start,
int end)
CharSeqCounter
numCharactersFollowing(cSlice)
= | { c | count(cSlice.c) > 0 } |
where count(cSlice.c) represents the count
of the character slice cSlice suffixed with the
character c.numCharactersFollowing in interface CharSeqCountercs - Underlying character array.start - Index of first character in slice.end - One plus index of last character in slice.public long totalSequenceCount()
public long totalSequenceCount(int length)
public long uniqueSequenceCount()
public long uniqueSequenceCount(int nGramOrder)
public void prune(int minCount)
minCount - Minimum count required to retain a substring
count.IllegalArgumentException - If the count is less than
1.public int[] nGramFrequencies(int nGramOrder)
nGramOrder - Order of n-gram counted.public long[][] uniqueTotalNGramCount()
0 for unique counts and 1 for total
counts. Thus for 0<=n<=maxLength():
uniqueTotalNGramCount()[n][0] == uniqueNGramCount(n)
and
uniqueTotalNGramCount()[n][1] == totalNGramCount(n)
If unique and total counts are required for several
n-gram depths, this method is much more efficient than
calling all of the individual methods separately.public ObjectToCounterMap<String> topNGrams(int nGramOrder, int maxReturn)
CharSequence.
The maximum number of results returned must be specified, because the entire set of n-grams is usually too large to return as a counter.
nGramOrder - Order of n-gram to count.maxReturn - Maximum number of objects returned.public long count(CharSequence cSeq)
cSeq - Character sequence.public long extensionCount(CharSequence cSeq)
cSeq - Character sequence.public void incrementSubstrings(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.IndexOutOfBoundsException - If the specified start and one plus
end point are not in the bounds of character sequence.public void incrementSubstrings(char[] cs,
int start,
int end,
int count)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.count - Amount to increment.IndexOutOfBoundsException - If the specified start and one plus
end point are not in the bounds of character sequence.public void incrementSubstrings(CharSequence cSeq)
cSeq - Character sequence.public void incrementSubstrings(CharSequence cSeq, int count)
cSeq - Character sequence.count - Amount to increment.public void incrementPrefixes(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.IndexOutOfBoundsException - If the specified start and one plus
end point are not in the bounds of character sequence.public void incrementPrefixes(char[] cs,
int start,
int end,
int count)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.count - Amount to increment.IndexOutOfBoundsException - If the specified start and one plus
end point are not in the bounds of character sequence.public void decrementSubstrings(char[] cs,
int start,
int end)
incrementSubstrings(char[],int,int) to implement
counts for conditional probability estimates without affecting
underlying estimates. For example, the following code:
will increment the substrings ofchar[] cs = "abcdefghi".toCharArray(); counter.incrementSubstrings(cs,3,7); counter.decrementSubstrings(cs,3,5);
"defg"
and then decrement the substrings of "de",
causing the net effect of incrementing the counts of substrings
"defg",
"efg",
"fg",
"g",
"def",
"ef", and
"f". This has the effect of increasing
the estimate of g given def, without
increasing the estimate of d in an empty context.cs - Underlying array of characters in slice.start - Index of first character in slice.end - Index of one past last character in slice.IllegalArgumentException - If the array slice is valid.public String toString()
Warning: The resulting string will be very large if the number of substrings is large. To avoid blowing out memory, do not call this method for large counters.
public void decrementUnigram(char c)
c - Decrement the unigram count for the specified
character.public void decrementUnigram(char c,
int count)
c - Decrement the unigram count for the specified
character.count - Amount to decrement.public void writeTo(OutputStream out) throws IOException
readFrom(InputStream).
The output is produced using a BitTrieWriter wrapped
around a BitOutput wrapped around the specified
underlying output stream. First, the bit output is used to
delta-code the maximum n-gram plus 1. Then, the trie is
encoded as described in BitTrieWriter. Finally, the
bit output is flushed. The underlying output stream is neither
flushed nor closed, allowing them to be used for other pruposes
after this counter is written.
If necessary for efficiency, streams should be buffered before being passed to this method.
out - Underlying output stream for writing.IOException - If there is an underlying I/O error.public static void writeCounter(CharSeqCounter counter, TrieWriter writer, int maxNGram) throws IOException
counter - Counter to write.writer - Trie writer to which counter is written.maxNGram - Maximum length n-gram written.IOException - If there is an underlying I/O error.public static TrieCharSeqCounter readFrom(InputStream in) throws IOException
The expected encoding is described in writeTo(OutputStream).
If necessary for efficiency, streams should be buffered before being passed to this method.
in - Underlying input stream for reading.IOException - If there is an underlying I/O error.public static TrieCharSeqCounter readCounter(TrieReader reader, int maxNGram) throws IOException
reader - Reader from which to read the trie.maxNGram - Maximum length n-gram to read.IOException - If there is an underlying I/O error.Copyright © 2019 Alias-i, Inc.. All rights reserved.