public class HeuristicSentenceModel extends AbstractSentenceModel
HeuristicSentenceModel determines sentence
boundaries based on sets of tokens, a pair of flags, and an
overridable method describing boundary conditions.
There are three sets of tokens specified for a heuristic model:
.)
and double quotes (").
"Mr".
''). Note that there is
a method, described below, which may enforce additional conditions
on start tokens.
There are also two flags in the constructor that determine aspects of sentence boundary detection:
"[", "]") and round brackets ("(", ")"),
are balanced separately. The brackets need not be nested, and
extra close parentheses (")") and brackets
("]") are ignored.
true, the final token in any input is taken to be a
sentence terminator, whether or not is a possible stop token. This
is useful for dealing with truncated inputs, such as those in
MEDLINE abstracts.
possibleStart(String[],String[],int,int). This method
checks a given token in sequence of tokens and whitespaces to
determine if it is a possible sentence start. The default
implementation in this class is to rule out tokens that start with
lowercase letters.
The final condition is that a token cannot be a stop unless it is followed by non-empty whitespace.
The resulting model will miss tokens as boundaries that act as both sentence boundaries and end-of-abbreviation markers for known abbreviations. It will add spurious sentence boundaries that appear after unknown abbreviations and are followed by whitespace and a capitalized word.
Our approach is loosely based on the article:
Mikheev, Andrei. 2002. Periods, Capitalized Words, etc. Computational Linguistics 28(3):289-318.
| Constructor and Description |
|---|
HeuristicSentenceModel(Set<String> possibleStops,
Set<String> impossiblePenultimate,
Set<String> impossibleStarts)
Constructs a capitalization-sensitive heuristic sentence model
with the specified set of possible stop tokens, impossible
penultimate tokens, and impossible sentence start tokens.
|
HeuristicSentenceModel(Set<String> possibleStops,
Set<String> impossiblePenultimate,
Set<String> impossibleStarts,
boolean forceFinalStop,
boolean balanceParens)
Construct a heuristic sentence model with the specified sets
of possible stop tokens, impossible penultimate tokens, impossible
start tokens, and flags for whether the final token is forced
to be a stop, and whether parentheses are balanced.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
balanceParens()
Returns
true if this model does parenthesis
balancing. |
void |
boundaryIndices(String[] tokens,
String[] whitespaces,
int start,
int length,
Collection<Integer> indices)
Adds the sentence final token indices as
Integer
instances to the specified collection, only considering tokens
between index start and end-1
inclusive. |
boolean |
forceFinalStop()
Returns
true if this model treats any input-final
token as a stop. |
protected boolean |
possibleStart(String[] tokens,
String[] whitespaces,
int start,
int end)
Return
true if the specified start index can
be a sentence start in the specified array of tokens and
whitespaces running up to the end token. |
boundaryIndices, boundaryIndices, verifyBounds, verifyTokensWhitespacespublic HeuristicSentenceModel(Set<String> possibleStops, Set<String> impossiblePenultimate, Set<String> impossibleStarts)
false.possibleStops - Possible tokens on which to stop a sentence.impossiblePenultimate - Tokens that may not precede a stop.impossibleStarts - Tokens that may not follow a stop.public HeuristicSentenceModel(Set<String> possibleStops, Set<String> impossiblePenultimate, Set<String> impossibleStarts, boolean forceFinalStop, boolean balanceParens)
possibleStops - Possible tokens on which to stop a sentence.impossiblePenultimate - Tokens that may not precede a stop.impossibleStarts - Tokens that may not follow a stop.public boolean forceFinalStop()
true if this model treats any input-final
token as a stop. This ensures that in truncated inputs, all
tokens are or are followed by a sentence boundary. For
instance, if the input is the array of tokens
{"a", "b", ".",
"c", "d"}, then if
"d" is not in the set of possible
stops, then the tokens "c" and
"d" will not be assigned to a sentence.
If the allow-any-final-token flag is true, then in
the case where the "d" is final in the
input, it will be taken to end a sentence.
The value is set in the constructor HeuristicSentenceModel(Set,Set,Set,boolean,boolean).
See the class documentation for more information.
true if any token may be a stop if
it is final in the input.public boolean balanceParens()
true if this model does parenthesis
balancing. Note that the value is set in the constructor
HeuristicSentenceModel(Set,Set,Set,boolean,boolean).
See the class documentation for more information.true if this model does parenthesis
balancing.public void boundaryIndices(String[] tokens, String[] whitespaces, int start, int length, Collection<Integer> indices)
Integer
instances to the specified collection, only considering tokens
between index start and end-1
inclusive.boundaryIndices in interface SentenceModelboundaryIndices in class AbstractSentenceModeltokens - Array of tokens to annotate.whitespaces - Array of whitespaces to annotate.start - Index of first token to annotate.length - Number of tokens to annotate.indices - Collection into which to write the boundary
indices.protected boolean possibleStart(String[] tokens, String[] whitespaces, int start, int end)
true if the specified start index can
be a sentence start in the specified array of tokens and
whitespaces running up to the end token.
The implementation in this class requires the first token to
be non-empty and have a first character that is not lower case
according to Character.isLowerCase(char).
The start and end indices should be within range for the
tokens and whitespaces as a precondition to this method being
called. For a precise definition, see AbstractSentenceModel.verifyBounds(String[],String[],int,int). All calls from the
abstract sentence model obey this constraint.
tokens - Array of tokens to check.whitespaces - Array of whitespaces to check.start - Index of first token to check.end - Index of last token to check.Copyright © 2019 Alias-i, Inc.. All rights reserved.