public class LineTaggingParser extends StringParser<ObjectHandler<Tagging<String>>>
The parser is specified by means of three regular expressions. If the ignore regular expression is matched, an input line is ignored. This is useful for ignoring empty lines and comments in some inputs. The eos regular expression recognizes lines that are ends of sentences. Whenever such a line is found, the zone currently being processed is sent to the handler. Finally, the match regular expression is used to extract tags and tokens from input lines, with the token index and tag index specifying the subgroup matched in the regular expression.
Here is a worked example for the CoNLL 2002 data set, a subsequence of which looks like:
And here's the regular expressions used to parse it:-DOCSTART- -DOCSTART- O Met Prep O tien Num O miljoen Num O komen V O we Pron O , Punc O denk V O ik Pron O , Punc O al Adv O een Art O heel Adj O eind N O . Punc O Dirk N B-PER ...
String TOKEN_TAG_LINE_REGEX
= "(\\S+)\\s(\\S+\\s)?(O|[B|I]-\\S+)"; // token ?posTag entityTag
int TOKEN_GROUP = 1; // token
int TAG_GROUP = 3; // entityTag
String IGNORE_LINE_REGEX
= "-DOCSTART(.*)"; // lines that start with "-DOCSTART"
String EOS_REGEX
= "\\A\\Z"; // empty/blank lines
Parser parser
= new RegexLineTagParser(TOKEN_TAG_LINE_REGEX,
TOKEN_GROUP, TAG_GROUP,
IGNORE_LINE_REGEX,
EOS_REGEX);
Lines starting with "-DOCSTART" are
ignored, blank lines end sentences; tokens and entity tags
are extracted by matching the regular expression and pulling
out match group 1 as the token and match group 3 as the tag.
An optional part-of-speech tag between the token and tag
on the line is ignored.
"\n"), carriage
return ("\r"), or carriage-return followed by line feed
("\r\n").| Constructor and Description |
|---|
LineTaggingParser(String matchRegex,
int tokenGroup,
int tagGroup,
String ignoreRegex,
String eosRegex)
Construct a regular expression tagging parser from the
specified regular expressions and indexes.
|
| Modifier and Type | Method and Description |
|---|---|
void |
parseString(char[] cs,
int start,
int end)
Parse the specified character slice as a string input.
|
parsegetHandler, parse, parse, parse, parse, parseString, setHandlerpublic LineTaggingParser(String matchRegex, int tokenGroup, int tagGroup, String ignoreRegex, String eosRegex)
matchRegex - Regular expression for matching tokens and tags.tokenGroup - Index of group in regular expression for token.tagGroup - Index of group in regular expression for tag.ignoreRegex - Lines matching this regular expression are
skipped.eosRegex - Matches end of sentence for grouping handle
events.public void parseString(char[] cs,
int start,
int end)
ParserparseString in class Parser<ObjectHandler<Tagging<String>>>cs - Characters underlying slice.start - Index of first character in slice.end - One past the index of the last character in slice.Copyright © 2019 Alias-i, Inc.. All rights reserved.