| Interface | Description |
|---|---|
| TokenCategorizer |
A
TokenCategorizer supplies a string-based
category for string-based tokens. |
| TokenizerFactory |
A
TokenizerFactory constructors tokenizers from
subsequences of character arrays. |
| Class | Description |
|---|---|
| CharacterTokenCategorizer |
Returns a category for tokens made up out of a single character.
|
| CharacterTokenizerFactory |
A
CharacterTokenizerFactory considers each
non-whitespace character in the input to be a distinct token. |
| EnglishStopTokenizerFactory |
An
EnglishStopTokenizerFactory applies an English stop
list to a contained base tokenizer factory. |
| IndoEuropeanTokenCategorizer |
A
IndoEuropeanTokenCategorizer is a generic token
categorizer for Indo-European languages that is based on character
"shape". |
| IndoEuropeanTokenizerFactory |
An
IndoEuropeanTokenizerFactory creates tokenizers
with built-in support for alpha-numerics, numbers, and other
common constructs in Indo-European langauges. |
| LineTokenizerFactory |
A
LineTokenizerFactory treats each line of an input as
a token. |
| LowerCaseTokenizerFactory |
A
LowerCaseTokenizerFactory filters the tokenizers produced
by a base tokenizer factory to produce lower case output. |
| ModifiedTokenizerFactory |
A
ModifiedTokenizerFactory is an abstract tokenizer factory
that modifies a tokenizer returned by a base tokenizer factory. |
| ModifyTokenTokenizerFactory |
The abstract base class
ModifyTokenTokenizerFactory
adapts token and whitespace modifiers to modify tokenizer
factories. |
| NGramTokenizerFactory |
An
NGramTokenizerFactory creates n-gram tokenizers
of a specified minimum and maximun length. |
| PorterStemmerTokenizerFactory |
A
PorterStemmerTokenizerFactory applies Porter's stemmer
to the tokenizers produced by a base tokenizer factory. |
| RegExFilteredTokenizerFactory |
A
RegExFilteredTokenizerFactory modifies the tokens
returned by a base tokenizer factory's tokizer by removing
those that do not match a regular expression pattern. |
| RegExTokenizerFactory |
A
RegExTokenizerFactory creates a tokenizer factory
out of a regular expression. |
| SoundexTokenizerFactory |
A
SoundexTokenizerFactory modifies the output of a base
tokenizer factory to produce tokens in soundex representation. |
| StopTokenizerFactory |
A
StopTokenizerFactory modifies a base tokenizer factory
by removing tokens in a specified stop set. |
| TokenChunker |
A
TokenChunker provides an implementationg of the Chunker interface based on an underlying tokenizer factory. |
| TokenFeatureExtractor |
A
TokenFeatureExtractor produces feature vectors from
character sequences representing token counts. |
| Tokenization |
A
Tokenization represents the result of tokenizing a
string. |
| Tokenizer |
The abstract class
Tokenizer serves as a base for tokenizer
implementations, which provide streams of tokens, whitespaces,
and positions. |
| TokenLengthTokenizerFactory |
A
TokenLengthTokenizerFactory filters the tokenizers produced
by a base tokenizer to only return tokens between specified lower and
upper length limits. |
| TokenNGramTokenizerFactory |
A
TokenNGramTokenizerFactory wraps a base tokenizer to
produce token n-gram tokens of a specified size. |
| WhitespaceNormTokenizerFactory |
A
WhitespaceNormTokenizerFactory filters the tokenizers produced
by a base tokenizer factory to convert non-empty whitespaces to a single
space and leave empty (zero-length) whitespaces alone. |
Copyright © 2016 Alias-i, Inc.. All rights reserved.