public interface CharSeqCounter
CharSeqCounter counter provides counts for sequences
of characters.
The method count(char[],int,int) returns the count of
the specified character array slice. The method extensionCount(char[],int,int) counts the number of
single-character extensions of the specified character array slice.
The maximum likelihood estimator can be computed directly from
these counts by:
PML(cN|c0,...cN-1)
= count({c0,...,cN},0,N+1)
/ extensionCount({c0,...,cN-1},0,N)
The reason the denominator is not a simple count of the context is
because of the way final suffix counts are incremented. For
instance, consider counts of all substrings of
"abab"; the maximum likelihood estimate of
P(a|b) is
count(ba)/extensionCount(b)=1/1, not
count(ba)/count(b)=1/2.
The method observedCharacters() returns an array of all
characters that appear in at least one substring. The method
method charactersFollowing(char[],int,int) returns the
number of characters observed following the specified character slice,
whereas numCharactersFollowing(char[],int,int) returns the
number of characters observed following the specified character
slice. These methods are useful for computing the Witten-Bell
estimator used in NGramProcessLM.
| Modifier and Type | Method and Description |
|---|---|
char[] |
charactersFollowing(char[] cs,
int start,
int end)
Returns the array of characters that have been observed
following the specified character slice in unicode order.
|
long |
count(char[] cs,
int start,
int end)
Returns the count for the specified character sequence.
|
long |
extensionCount(char[] cs,
int start,
int end)
Returns the sum of the counts of all character sequences one
character longer than the specified character slice.
|
int |
numCharactersFollowing(char[] cs,
int start,
int end)
Returns the number of characters that when appended to the end
of the specified character slice produce an extended slice with
a non-zero count.
|
char[] |
observedCharacters()
Returns an array consisting of the characters with non-zero
count in unicode order.
|
long count(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.long extensionCount(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.int numCharactersFollowing(char[] cs,
int start,
int end)
numCharactersFollowing(cSlice)
= | { c | count(cSlice.c) > 0 } |
where count(cSlice.c) represents the count
of the character slice cSlice suffixed with the
character c.cs - Underlying character array.start - Index of first character in slice.end - One plus index of last character in slice.IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.char[] charactersFollowing(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - One plus index of last character in slice.IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.char[] observedCharacters()
charactersFollowing(new
char[0],0,0).Copyright © 2016 Alias-i, Inc.. All rights reserved.