public class MedlineSentenceModel extends HeuristicSentenceModel implements Serializable
MedlineSentenceModel is a heuristic sentence model
designed for operating over biomedical research abstracts as found
in MEDLINE.
The MEDLINE model assumes that parentheses are balanced as
defined in the class documentation for HeuristicSentenceModel. It also assumes the final token is a
sentence boundary, overriding any other possible checks. This is
set because there are many truncated MEDLINE abstracts, and this
ensures that every token falls within a sentence in the result.
The sets required by the superclass constructor HeuristicSentenceModel.HeuristicSentenceModel(Set,Set,Set,boolean,boolean)
determine which tokens are possible sentence stops, which are
disallowed before stops, and which are disallowed as starts. These
three sets are:
Possible Stops ...!?
Impossible Penultimates some scientific and publishing terms personal/professional titles/suffixes months, times corporate designators common abbreviations back quotes, commas
Impossible Sentence Starts possible stops (see above) close parens, brackets, braces ;:------%
This class overrides the default implementation of the possible
start token method to allow a sentence start to be any sequence of
tokens uninterrupted by spaces that contains a non-lowercase letter
character. This behavior is described with examples in its
implementing method's documentation: possibleStart(String[],String[],int,int).
INSTANCE
may be used anywhere a MEDLINE sentence model is needed.
| Modifier and Type | Field and Description |
|---|---|
static MedlineSentenceModel |
INSTANCE
A single instance which may be used anywhere a MEDLINE
sentence model is needed.
|
| Constructor and Description |
|---|
MedlineSentenceModel()
Construct a MEDLINE sentence model.
|
| Modifier and Type | Method and Description |
|---|---|
protected boolean |
possibleStart(String[] tokens,
String[] whitespaces,
int start,
int end)
Return
true if the specified start index can
be a sentence start in the specified array of tokens and
whitespaces running up to the end token. |
balanceParens, boundaryIndices, forceFinalStopboundaryIndices, boundaryIndices, verifyBounds, verifyTokensWhitespacespublic static final MedlineSentenceModel INSTANCE
public MedlineSentenceModel()
protected boolean possibleStart(String[] tokens, String[] whitespaces, int start, int end)
true if the specified start index can
be a sentence start in the specified array of tokens and
whitespaces running up to the end token.
For MEDLINE, this implementation returns true
if the sequence of contiguous tokens starting with the
specified token contains an uppercase or digit character. Each
token is considered, beginning with the specified start token
and continuing through all tokens that are not separated by
non-empty whitespace, up to the token with the end index minus
one. If any of the tokens contains an uppercase or digit
character, then the result is true. Otherwise,
the result is false.
For example, if the first token is "Therefore", then it can be a sentence start because it contains the non-lowercase letter "T". Similarly, the token "pH" can be a sentence start, as can "p53", because they have non-lower-case characters "H" and "5" respectively. If the underlying sequence is " correlation. p-53 was...", then the array of tokens and whitespaces is:
Here, "p" is a valid sentence start token even though it is only a single lowercase character, because it is followed by a hyphen (
Index Whitespace Token 0 " "correlation1 "".2 " "p3 ""-4 ""535 " "was6 ... " "Tokenization of: " correlation. p-53 was ..."
-) with no intervening whitespace.
By way of contrast, the first token
"and" in the sequence "and
Foo", can't start a sentence because it is separated
from the following token by a non-empty whitespace.
Recall that the whitespace with
the same index as a token precedes the token.possibleStart in class HeuristicSentenceModeltokens - Array of tokens to check.whitespaces - Array of whitespaces to check.start - Index of first token to check.end - Index of last token to check.Copyright © 2019 Alias-i, Inc.. All rights reserved.