public class BioTagChunkCodec extends Object implements Serializable
BioTagChunkCodec implements a chunk to tag
coder/decoder based on the BIO encoding scheme and a
specified tokenizer factory.
B_X, and subsequent tokens
in a chunk of type X are labeled I_X.
All other chunks are labeled with the tag O.
For instance, consider the following input string:
and chunking consisting of the string and chunks:John Jones Mary and Mr. J. J. Jones ran to Washington. 012345678901234567890123456789012345678901234567890123 0 1 2 3 4 5
Recall that indexing is of the first character and one past the last character. Note that the two person names "John Jones" and "Mary", are separate chunks of type PER (for persons), and the location chunk for "Washington" ends before the period.(0,10):PER, (11,15):PER, (24,35):PER, (43,53):LOC
If we have a tokenizer that breaks on whitespace and punctuation, we have tokens starting at + and continuing through the - signs.
In particular, note that the the four periods form their own tokens, even though they are adjacent to characters in other tokens. Writing the tokens out in a column, we show the tags used by the default BIO encoding to the right:John Jones Mary and Mr. J. J. Jones ran to Washington. +--- +---- +--- +-- +-+ ++ ++ +---- +-- +- +---------+
Note that chunks may be any number of tokens long.
Token Tag John B_PERJones I_PERMary B_PERand OMr O. OJ B_PER. I_PERJ I_PER. I_PERJones I_PERran Oto OWashington B_LOC. O
B_ and the default prefix
for in tags is I_ and the default out tag is O. Different
begin and in prefixes and out tags may be specified using the constructor
BioTagChunkCodec(TokenizerFactory,boolean,String,String,String).
O, as well as tags
B_X and I_X for
all chunk types X.
Note that the begin and in tags have the same legal followers.
Tag Legal Following Tags OO, B_XB_XO, I_X, B_YI_XO, I_X, B_Y
Attempts to encode taggings with illegal tag sequences will result in exceptions.
If the consistency flag is set on the constructor, attempts to encode chunkings or decode taggings that are inconsistent with the tokenizer will throw illegal argument exceptions.
In order for a tokenizer to be consistent with a chunking, the tokenization of the characterer sequence for the chunking must be such that every chunk start and end occurs at a token start or end. The same rule applies for tagging, in that the chunking produced has to obey the same rules.
For example, if a regular-expression based tokenizer that breaks on whitespace were used for the above example, the character sequence "Washington." is a token, including the final period. This conflicts with the location-type entity, which ends with the last character before the period.
| Modifier and Type | Field and Description |
|---|---|
static String |
BEGIN_TAG_PREFIX
Default prefix for begin tags,
"B_". |
static String |
IN_TAG_PREFIX
Default prefix for continuation tags,
"I_". |
static String |
OUT_TAG
Default name of out tag,
"O". |
| Constructor and Description |
|---|
BioTagChunkCodec()
Construct a BIO-encoded tag/chunk decoder with a null tokenizer
that does not enforce consistency and uses the default begin,
in, and out tags.
|
BioTagChunkCodec(TokenizerFactory tokenizerFactory,
boolean enforceConsistency)
Construct a BIO-encoding based tag-chunk coder/decoder
based on the specified tokenizer factory, enforcing
consistency of chunkings and tagging coded if the specified
flag is set, and using the default being, in, and out tags.
|
BioTagChunkCodec(TokenizerFactory tokenizerFactory,
boolean enforceConsistency,
String beginTagPrefix,
String inTagPrefix,
String outTag)
Construct a BIO-encoding based tag-chunk coder/decoder
based on the specified tokenizer factory, enforcing
consistency of chunkings and tagging coded if the specified
flag is set.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
enforceConsistency()
Returns
true if this codec enforces consistency
of the chunkings relative to the tokenizer factory. |
boolean |
isDecodable(StringTagging tagging)
Returns
true if the specified tagging may be
consistently decoded into a chunking. |
boolean |
isEncodable(Chunking chunking)
Returns
true if the specified chunking may be
consistently encoded as a tagging. |
boolean |
legalTags(String... tags)
Returns
true if the specified sequence of tags is a
complete legal tag sequence. |
boolean |
legalTagSubSequence(String... tags)
Returns
true if the specified sequence of tags
is a legal subsequence of tags. |
Iterator<Chunk> |
nBestChunks(TagLattice<String> lattice,
int[] tokenStarts,
int[] tokenEnds,
int maxResults)
Returns an iterator over chunks extracted in order of highest
probability up to the specified maximum number of results.
|
Set<String> |
tagSet(Set<String> chunkTypes)
Returns the complete set of tags used by this codec
for the specified set of chunk types.
|
Chunking |
toChunking(StringTagging tagging)
Return the result of decoding the specified tagging into
a chunking.
|
TokenizerFactory |
tokenizerFactory()
Return the tokenizer factory for this codec.
|
String |
toString()
Return a string-based representation of this codec.
|
StringTagging |
toStringTagging(Chunking chunking)
Return the string tagging that fully encodes the specified
chunking.
|
Tagging<String> |
toTagging(Chunking chunking)
Return the tagging that partially encodes the specified
chunking.
|
public static final String OUT_TAG
"O".public static final String BEGIN_TAG_PREFIX
"B_".public static final String IN_TAG_PREFIX
"I_".public BioTagChunkCodec()
public BioTagChunkCodec(TokenizerFactory tokenizerFactory, boolean enforceConsistency)
tokenizerFactory - Tokenizer factory for generating tokens.enforceConsistency - Set to true to ensure all
coded chunkings and decoded taggings are consistent for
round trips.public BioTagChunkCodec(TokenizerFactory tokenizerFactory, boolean enforceConsistency, String beginTagPrefix, String inTagPrefix, String outTag)
tokenizerFactory - Tokenizer factory for generating tokens.enforceConsistency - Set to true to ensure all
coded chunkings and decoded taggings are consistent for
round trips.public boolean enforceConsistency()
true if this codec enforces consistency
of the chunkings relative to the tokenizer factory. Consistency
requires each chunk to start on the first character of a token
and requires each chunk to end on the last character of
a token (as usual, ends are one past the last character).true if this codec enforces consistency of
chunkings relative to tokenization.public Set<String> tagSet(Set<String> chunkTypes)
TagChunkCodecModifying the returned set will not affect the codec.
tagSet in interface TagChunkCodecchunkTypes - Set of types for chunks.public boolean legalTagSubSequence(String... tags)
TagChunkCodectrue if the specified sequence of tags
is a legal subsequence of tags. See the companion
method TagChunkCodec.legalTags(String[]) to test if a complete
sequence is legal.
A sequence of tags is a legal subsequence if a legal sequence may be created by adding more tags to the front and/or end of the specified sequence.
Providing an empty sequence of tags always returns true. The result for a single input tag determines if the tag
itself is legal. For longer sequences, the tags must all be
legal and their order must be legal.
legalTagSubSequence in interface TagChunkCodectags - Sequence of tags to test.true if the sequence of tags is legal as a
subsequence of some larger sequence.public boolean legalTags(String... tags)
TagChunkCodectrue if the specified sequence of tags is a
complete legal tag sequence. The companion method TagChunkCodec.legalTagSubSequence(String[]) tests if a substring of tags is
legal.legalTags in interface TagChunkCodectags - Variable length array of tags.true if the specified sequence of tags is
a complete legal tag sequence.public Chunking toChunking(StringTagging tagging)
TagChunkCodectoChunking in interface TagChunkCodectagging - Tagging to decode.public StringTagging toStringTagging(Chunking chunking)
TagChunkCodectoStringTagging in interface TagChunkCodecchunking - Chunking to encode.UnsupportedOperationException - If the tokenizer factory is null.public Tagging<String> toTagging(Chunking chunking)
TagChunkCodecTagChunkCodec.toStringTagging(Chunking).
This method will typically be more efficient than toStringTagging(), but implementations may just return the
same value, because StringTagging extends Tagging<String>.
This method may be implemented by delegating to
call to TagChunkCodec.toStringTagging(Chunking), but a direct
implementation is often more efficient.
toTagging in interface TagChunkCodecchunking - Chunking to encode.UnsupportedOperationException - If the tokenizer factory is null.public Iterator<Chunk> nBestChunks(TagLattice<String> lattice, int[] tokenStarts, int[] tokenEnds, int maxResults)
TagChunkCodecnBestChunks in interface TagChunkCodeclattice - Lattice from which chunks are extracted.maxResults - Maximum number of chunks to return.public String toString()
public TokenizerFactory tokenizerFactory()
public boolean isEncodable(Chunking chunking)
true if the specified chunking may be
consistently encoded as a tagging. A chunking is encodable if
none of the chunks overlap, and if all chunks begin on the
first character of a token and end on the character one past
the end of the last character in a token.
Subclasses may enforce further conditions as defined in their class documentation.
isEncodable in interface TagChunkCodecchunking - Chunking to test.true if the chunking is consistently encodable.UnsupportedOperationException - If the tokenizer is null so that
this is only a decoder.public boolean isDecodable(StringTagging tagging)
true if the specified tagging may be
consistently decoded into a chunking. A tagging is decodable
if its tokens are the tokens produced by the tokenizer for this
coded and if the tags form a legal sequence.isDecodable in interface TagChunkCodectagging - Tagging to test for decodability.true if decoding then encoding produces the
specified tagging.UnsupportedOperationException - If the tokenizer is null so that
this is only a decoder.Copyright © 2016 Alias-i, Inc.. All rights reserved.