public class IoTagChunkCodec extends Object implements Serializable
IoTagChunkCodec implements a chunk to tag
coder/decoder based on the IO encoding scheme and a
specified tokenizer factory.
Although this is a compact encoding in number of tags, it is
degenerate in that it does not allow adjacent chunks of the same
type. The isEncodable(Chunking) method reflects this
behavior.
If consistency is not being enforced, the two entities will simply be run together as a single entity.
X are tagged as X
and all tokens that are not part of an entity are tagged as O.
For instance, consider the following input string:
and chunking consisting of the string and chunks:John Jones Mary and Mr. J. J. Jones ran to Washington. 012345678901234567890123456789012345678901234567890123 0 1 2 3 4 5
Recall that indexing is of the first character and one past the last character. Note that the two person names "John Jones" and "Mary", are separate chunks of type PER (for persons), and the location chunk for "Washington" ends before the period.(0,10):PER, (11,15):PER, (24,35):PER, (43,53):LOC
If we have a tokenizer that breaks on whitespace and punctuation, we have tokens starting at + and continuing through the - signs.
In particular, note that the the four periods form their own tokens, even though they are adjacent to characters in other tokens. Writing the tokens out in a column, we show the tags used by the BIO encoding to the right:John Jones Mary and Mr. J. J. Jones ran to Washington. +--- +---- +--- +-- +-+ ++ ++ +---- +-- +- +---------+
Note that chunks may be any number of tokens long.
Token Tag John PERJones PERMary PERand OMr O. OJ PER. PERJ PER. PERJones PERran Oto OWashington LOC. O
O, as well as tags
X for each chunk type.
If the consistency flag is set on the constructor, attempts to encode chunkings or decode taggings that are inconsistent with the tokenizer will throw illegal argument exceptions.
In order for a tokenizer to be consistent with a chunking, the tokenization of the characterer sequence for the chunking must be such that every chunk start and end occurs at a token start or end. The same rule applies for tagging, in that the chunking produced has to obey the same rules.
For example, if a regular-expression based tokenizer that breaks on whitespace were used for the above example, the character sequence "Washington." is a token, including the final period. This conflicts with the location-type entity, which ends with the last character before the period.
| Constructor and Description |
|---|
IoTagChunkCodec()
Construct an IO-encoding based tag-chunk coder with a null
tokenizer factory that does not enforce cons.
|
IoTagChunkCodec(TokenizerFactory tokenizerFactory,
boolean enforceConsistency)
Construct an IO-encoding based tag-chunk coder/decoder based on
the specified tokenizer factory, enforcing consistency of
chunkings and taggings if the specified flag is set.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
enforceConsistency()
Returns
true if this codec enforces consistency
of the chunkings relative to the tokenizer factory. |
boolean |
isDecodable(StringTagging tagging)
Returns
true if the specified tagging may be
consistently decoded into a chunking. |
boolean |
isEncodable(Chunking chunking)
Returns
true if the specified chunking may be
consistently encoded as a tagging. |
boolean |
legalTags(String... tags)
Returns
true if the specified sequence of tags is a
complete legal tag sequence. |
boolean |
legalTagSubSequence(String... tags)
Returns
true if the specified sequence of tags
is a legal subsequence of tags. |
Iterator<Chunk> |
nBestChunks(TagLattice<String> lattice,
int[] tokenStarts,
int[] tokenEnds,
int maxResults)
Returns an iterator over chunks extracted in order of highest
probability up to the specified maximum number of results.
|
Set<String> |
tagSet(Set<String> chunkTypes)
Returns the complete set of tags used by this codec
for the specified set of chunk types.
|
Chunking |
toChunking(StringTagging tagging)
Return the result of decoding the specified tagging into
a chunking.
|
TokenizerFactory |
tokenizerFactory()
Return the tokenizer factory for this codec.
|
String |
toString()
Return a string-based representation of this codec.
|
StringTagging |
toStringTagging(Chunking chunking)
Return the string tagging that fully encodes the specified
chunking.
|
Tagging<String> |
toTagging(Chunking chunking)
Return the tagging that partially encodes the specified
chunking.
|
public IoTagChunkCodec()
public IoTagChunkCodec(TokenizerFactory tokenizerFactory, boolean enforceConsistency)
tokenizerFactory - Tokenizer factory for generating tokens.enforceConsistency - Set to true to ensure all
coded chunkings and decoded taggings are consistent for
round trips.public Set<String> tagSet(Set<String> chunkTypes)
TagChunkCodecModifying the returned set will not affect the codec.
tagSet in interface TagChunkCodecchunkTypes - Set of types for chunks.public boolean legalTagSubSequence(String... tags)
TagChunkCodectrue if the specified sequence of tags
is a legal subsequence of tags. See the companion
method TagChunkCodec.legalTags(String[]) to test if a complete
sequence is legal.
A sequence of tags is a legal subsequence if a legal sequence may be created by adding more tags to the front and/or end of the specified sequence.
Providing an empty sequence of tags always returns true. The result for a single input tag determines if the tag
itself is legal. For longer sequences, the tags must all be
legal and their order must be legal.
legalTagSubSequence in interface TagChunkCodectags - Sequence of tags to test.true if the sequence of tags is legal as a
subsequence of some larger sequence.public boolean legalTags(String... tags)
TagChunkCodectrue if the specified sequence of tags is a
complete legal tag sequence. The companion method TagChunkCodec.legalTagSubSequence(String[]) tests if a substring of tags is
legal.legalTags in interface TagChunkCodectags - Variable length array of tags.true if the specified sequence of tags is
a complete legal tag sequence.public Chunking toChunking(StringTagging tagging)
TagChunkCodectoChunking in interface TagChunkCodectagging - Tagging to decode.public StringTagging toStringTagging(Chunking chunking)
TagChunkCodectoStringTagging in interface TagChunkCodecchunking - Chunking to encode.UnsupportedOperationException - If the tokenizer factory is null.public Tagging<String> toTagging(Chunking chunking)
TagChunkCodecTagChunkCodec.toStringTagging(Chunking).
This method will typically be more efficient than toStringTagging(), but implementations may just return the
same value, because StringTagging extends Tagging<String>.
This method may be implemented by delegating to
call to TagChunkCodec.toStringTagging(Chunking), but a direct
implementation is often more efficient.
toTagging in interface TagChunkCodecchunking - Chunking to encode.UnsupportedOperationException - If the tokenizer factory is null.public Iterator<Chunk> nBestChunks(TagLattice<String> lattice, int[] tokenStarts, int[] tokenEnds, int maxResults)
TagChunkCodecnBestChunks in interface TagChunkCodeclattice - Lattice from which chunks are extracted.maxResults - Maximum number of chunks to return.public String toString()
public boolean enforceConsistency()
true if this codec enforces consistency
of the chunkings relative to the tokenizer factory. Consistency
requires each chunk to start on the first character of a token
and requires each chunk to end on the last character of
a token (as usual, ends are one past the last character).true if this codec enforces consistency of
chunkings relative to tokenization.public TokenizerFactory tokenizerFactory()
public boolean isEncodable(Chunking chunking)
true if the specified chunking may be
consistently encoded as a tagging. A chunking is encodable if
none of the chunks overlap, and if all chunks begin on the
first character of a token and end on the character one past
the end of the last character in a token.
Subclasses may enforce further conditions as defined in their class documentation.
isEncodable in interface TagChunkCodecchunking - Chunking to test.true if the chunking is consistently encodable.UnsupportedOperationException - If the tokenizer is null so that
this is only a decoder.public boolean isDecodable(StringTagging tagging)
true if the specified tagging may be
consistently decoded into a chunking. A tagging is decodable
if its tokens are the tokens produced by the tokenizer for this
coded and if the tags form a legal sequence.isDecodable in interface TagChunkCodectagging - Tagging to test for decodability.true if decoding then encoding produces the
specified tagging.UnsupportedOperationException - If the tokenizer is null so that
this is only a decoder.Copyright © 2016 Alias-i, Inc.. All rights reserved.