| Package | Description |
|---|---|
| com.aliasi.chunk |
Classes for extracting meaningful chunks (spans) of text.
|
| com.aliasi.classify |
Classes for classifying data and evaluation.
|
| com.aliasi.cluster |
Classes for clustering data and evaluation.
|
| com.aliasi.coref |
Classes for determining entity coreference within documents.
|
| com.aliasi.crf |
Classes and interfaces for conditional random fields.
|
| com.aliasi.dict |
Classes for handling dictionaries.
|
| com.aliasi.lm |
Classes for character- and token-based language models.
|
| com.aliasi.sentences |
Classes for sentence-boundary detection.
|
| com.aliasi.spell |
Classes for spelling correction and edit distance.
|
| com.aliasi.suffixarray |
Classes for spelling correction and edit distance.
|
| com.aliasi.test.unit.tokenizer | |
| com.aliasi.tokenizer |
Classes for tokenizing character sequences.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
CharLmHmmChunker.getTokenizerFactory()
Return the tokenizer factory for this chunker.
|
TokenizerFactory |
HmmChunker.getTokenizerFactory()
Returns the underlying tokenizer factory for this chunker.
|
| Modifier and Type | Method and Description |
|---|---|
static boolean |
CharLmHmmChunker.consistentTokens(String[] toks,
String[] whitespaces,
TokenizerFactory tokenizerFactory) |
| Constructor and Description |
|---|
BioTagChunkCodec(TokenizerFactory tokenizerFactory,
boolean enforceConsistency)
Construct a BIO-encoding based tag-chunk coder/decoder
based on the specified tokenizer factory, enforcing
consistency of chunkings and tagging coded if the specified
flag is set, and using the default being, in, and out tags.
|
BioTagChunkCodec(TokenizerFactory tokenizerFactory,
boolean enforceConsistency,
String beginTagPrefix,
String inTagPrefix,
String outTag)
Construct a BIO-encoding based tag-chunk coder/decoder
based on the specified tokenizer factory, enforcing
consistency of chunkings and tagging coded if the specified
flag is set.
|
CharLmHmmChunker(TokenizerFactory tokenizerFactory,
AbstractHmmEstimator hmmEstimator)
Construct a
CharLmHmmChunker from the specified
tokenizer factory and hidden Markov model estimator. |
CharLmHmmChunker(TokenizerFactory tokenizerFactory,
AbstractHmmEstimator hmmEstimator,
boolean smoothTags)
Construct a
CharLmHmmChunker from the specified
tokenizer factory, HMM estimator and tag-smoothing flag. |
CharLmRescoringChunker(TokenizerFactory tokenizerFactory,
int numChunkingsRescored,
int nGram,
int numChars,
double interpolationRatio)
Construct a character language model rescoring chunker based on
the specified components.
|
CharLmRescoringChunker(TokenizerFactory tokenizerFactory,
int numChunkingsRescored,
int nGram,
int numChars,
double interpolationRatio,
boolean smoothTags)
Construct a character language model rescoring chunker based on
the specified components.
|
HmmChunker(TokenizerFactory tokenizerFactory,
HmmDecoder decoder)
Construct a chunker from the specified tokenizer factory
and hidden Markov model decoder.
|
IoTagChunkCodec(TokenizerFactory tokenizerFactory,
boolean enforceConsistency)
Construct an IO-encoding based tag-chunk coder/decoder based on
the specified tokenizer factory, enforcing consistency of
chunkings and taggings if the specified flag is set.
|
TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory)
Construct a trainer for a token/shape chunker based on
the specified token categorizer and tokenizer factory.
|
TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory,
int knownMinTokenCount,
int minTokenCount,
int minTagCount)
Construct a trainer for a token/shape chunker based on
the specified token categorizer, tokenizer factory and
numerical parameters.
|
| Modifier and Type | Method and Description |
|---|---|
static DynamicLMClassifier<TokenizedLM> |
DynamicLMClassifier.createTokenized(String[] categories,
TokenizerFactory tokenizerFactory,
int maxTokenNGram)
Construct a dynamic language model classifier over the
specified categories using token n-gram language models of the
specified order and the specified tokenizer factory for
tokenization.
|
| Constructor and Description |
|---|
NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory)
Construct a naive Bayes classifier with the specified
categories and tokenizer factory.
|
NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory,
int charSmoothingNGram)
Construct a naive Bayes classifier with the specified
categories, tokenizer factory and level of character n-gram for
smoothing token estimates.
|
NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory,
int charSmoothingNGram,
int maxObservedChars)
Construct a naive Bayes classifier with the specified
categories, tokenizer factory and level of character n-gram for
smoothing token estimates, along with a specification of the
total number of characters in test and training instances.
|
TradNaiveBayesClassifier(Set<String> categorySet,
TokenizerFactory tokenizerFactory)
Constructs a naive Bayes classifier over the specified
categories, using the specified tokenizer factory.
|
TradNaiveBayesClassifier(Set<String> categorySet,
TokenizerFactory tokenizerFactory,
double categoryPrior,
double tokenInCategoryPrior,
double lengthNorm)
Constructs a naive Bayes classifier over the specified
categories, using the specified tokenizer factory, priors and
length normalization.
|
| Modifier and Type | Method and Description |
|---|---|
static int[] |
LatentDirichletAllocation.tokenizeDocument(CharSequence text,
TokenizerFactory tokenizerFactory,
SymbolTable symbolTable)
Tokenizes the specified text document using the specified tokenizer
factory returning only tokens that exist in the symbol table.
|
static int[][] |
LatentDirichletAllocation.tokenizeDocuments(CharSequence[] texts,
TokenizerFactory tokenizerFactory,
SymbolTable symbolTable,
int minCount)
Tokenize an array of text documents represented as character
sequences into a form usable by LDA, using the specified
tokenizer factory and symbol table.
|
| Constructor and Description |
|---|
AbstractMentionFactory(TokenizerFactory tokenizerFactory)
Construct an abstract mention factory with the specified
tokenizer factory.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
ChainCrfChunker.tokenizerFactory()
Return the tokenizer factory for this chunker.
|
| Modifier and Type | Method and Description |
|---|---|
static ChainCrfChunker |
ChainCrfChunker.estimate(Corpus<ObjectHandler<Chunking>> chunkingCorpus,
TagChunkCodec codec,
TokenizerFactory tokenizerFactory,
ChainCrfFeatureExtractor<String> featureExtractor,
boolean addInterceptFeature,
int minFeatureCount,
boolean cacheFeatureVectors,
RegressionPrior prior,
int priorBlockSize,
AnnealingSchedule annealingSchedule,
double minImprovement,
int minEpochs,
int maxEpochs,
Reporter reporter)
Return the chain CRF-based chunker estimated from the specified
corpus, which is converted to a tagging corpus using the
specified coder/decoder and tokenizer factory, then passed to
the chain CRF estimate method along with the rest of the
arguments.
|
| Constructor and Description |
|---|
ChainCrfChunker(ChainCrf<String> crf,
TokenizerFactory tokenizerFactory,
TagChunkCodec codec)
Construct a chunker based on the specified chain conditional
random field, tokenizer factory and tag-chunk coder/decoder.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
ApproxDictionaryChunker.tokenizerFactory()
Returns the tokenizer factory for matching with this
chunker.
|
TokenizerFactory |
ExactDictionaryChunker.tokenizerFactory()
Returns the tokenizer factory underlying this chunker.
|
| Constructor and Description |
|---|
ApproxDictionaryChunker(TrieDictionary<String> dictionary,
TokenizerFactory tokenizerFactory,
WeightedEditDistance editDistance,
double distanceThreshold)
Construct an approximate dictionary chunker from the specified
dictionary, tokenizer factory, weighted edit distance and
distance bound.
|
ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory)
Construct an exact dictionary chunker from the specified
dictionary and tokenizer factory which is case sensitive and
returns all matches.
|
ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory,
boolean returnAllMatches,
boolean caseSensitive)
Construct an exact dictionary chunker from the specified
dictionary and tokenizer factory, returning all matches or not
as specified.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
TokenizedLM.tokenizerFactory()
Returns the tokenizer factory for this tokenized language
model.
|
| Constructor and Description |
|---|
TokenizedLM(TokenizerFactory factory,
int nGramOrder)
Constructs a tokenized language model with the specified
tokenization factory and n-gram order (see warnings below on
where this simple constructor may be used).
|
TokenizedLM(TokenizerFactory tokenizerFactory,
int nGramOrder,
LanguageModel.Sequence unknownTokenModel,
LanguageModel.Sequence whitespaceModel,
double lambdaFactor)
Construct a tokenized language model with the specified
tokenization factory and n-gram order, sequence models for
unknown tokens and whitespace, and an interpolation
hyperparameter.
|
TokenizedLM(TokenizerFactory tokenizerFactory,
int nGramOrder,
LanguageModel.Sequence unknownTokenModel,
LanguageModel.Sequence whitespaceModel,
double lambdaFactor,
boolean initialIncrementBoundary)
Construct a tokenized language model with the specified
tokenization factory and n-gram order, sequence models for
unknown tokens and whitespace, and an interpolation
hyperparameter, as well as a flag indicating whether to
automatically increment a null input to avoid numerical
problems with zero counts.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
SentenceChunker.tokenizerFactory()
Returns the tokenizer factory for this chunker.
|
| Constructor and Description |
|---|
SentenceChunker(TokenizerFactory tf,
SentenceModel sm)
Construct a sentence chunker from the specified tokenizer
factory and sentence model.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
TokenizedDistance.tokenizerFactory()
Return the tokenizer factory for this tokenized distance.
|
TokenizerFactory |
CompiledSpellChecker.tokenizerFactory()
Returns the tokenizer factory for this spell checker.
|
| Modifier and Type | Method and Description |
|---|---|
void |
CompiledSpellChecker.setTokenizerFactory(TokenizerFactory factory)
Sets the tokenizer factory for input processing to the
specified value.
|
| Constructor and Description |
|---|
CompiledSpellChecker(CompiledNGramProcessLM lm,
WeightedEditDistance editDistance,
TokenizerFactory factory,
Set<String> tokenSet,
int nBestSize)
Construct a compiled spell checker based on the specified
language model and edit distance, tokenizer factory, the
set of valid output tokens, and maximum n-best size, with
default known token and first and second character edit costs.
|
CompiledSpellChecker(CompiledNGramProcessLM lm,
WeightedEditDistance editDistance,
TokenizerFactory factory,
Set<String> tokenSet,
int nBestSize,
double knownTokenEditCost,
double firstCharEditCost,
double secondCharEditCost)
Construct a compiled spell checker based on the specified
language model and similarity edit distance, set of valid
output tokens, maximum n-best size per character, and the
specified edit penalities for editing known tokens or the first
or second characters of tokens.
|
JaccardDistance(TokenizerFactory factory)
Construct an instance of Jaccard string distance using
the specified tokenizer factory.
|
TfIdfDistance(TokenizerFactory tokenizerFactory)
Construct an instance of TF/IDF string distance based on the
specified tokenizer factory.
|
TokenizedDistance(TokenizerFactory tokenizerFactory)
Construct a tokenized distance from the specified tokenizer
factory.
|
TrainSpellChecker(NGramProcessLM lm,
WeightedEditDistance editDistance,
TokenizerFactory tokenizerFactory)
Construct a spell checker trainer from the specified n-gram
process language model, tokenizer factory and edit distance.
|
| Constructor and Description |
|---|
DocumentTokenSuffixArray(Map<String,String> idToDocMap,
TokenizerFactory tf,
int maxSuffixLength,
String documentBoundaryToken)
Construct a suffix array from the specified identified document
collection using the specified tokenizer factory, limiting comparisons
to the specified maximum suffix length and separating documents with
the specified boundary token.
|
| Modifier and Type | Class and Description |
|---|---|
class |
ConstantTokenizerFactory |
| Modifier and Type | Method and Description |
|---|---|
static void |
TokenizerTest.assertFactory(TokenizerFactory factory,
String input,
String... tokens) |
static void |
TokenizerTest.assertFactory(TokenizerFactory factory,
String input,
String[] tokens,
String[] whitespaces) |
static void |
TokenizerTest.assertTokenization(TokenizerFactory factory,
String input,
String[] tokens,
String[] whitespaces) |
protected void |
RegExTokenizerFactoryTest.assertTokenize(String input,
String[] whitespaces,
String[] tokens,
int[] starts,
TokenizerFactory factory) |
| Modifier and Type | Class and Description |
|---|---|
class |
CharacterTokenizerFactory
A
CharacterTokenizerFactory considers each
non-whitespace character in the input to be a distinct token. |
class |
EnglishStopTokenizerFactory
An
EnglishStopTokenizerFactory applies an English stop
list to a contained base tokenizer factory. |
class |
IndoEuropeanTokenizerFactory
An
IndoEuropeanTokenizerFactory creates tokenizers
with built-in support for alpha-numerics, numbers, and other
common constructs in Indo-European langauges. |
class |
LineTokenizerFactory
A
LineTokenizerFactory treats each line of an input as
a token. |
class |
LowerCaseTokenizerFactory
A
LowerCaseTokenizerFactory filters the tokenizers produced
by a base tokenizer factory to produce lower case output. |
class |
ModifiedTokenizerFactory
A
ModifiedTokenizerFactory is an abstract tokenizer factory
that modifies a tokenizer returned by a base tokenizer factory. |
class |
ModifyTokenTokenizerFactory
The abstract base class
ModifyTokenTokenizerFactory
adapts token and whitespace modifiers to modify tokenizer
factories. |
class |
NGramTokenizerFactory
An
NGramTokenizerFactory creates n-gram tokenizers
of a specified minimum and maximun length. |
class |
PorterStemmerTokenizerFactory
A
PorterStemmerTokenizerFactory applies Porter's stemmer
to the tokenizers produced by a base tokenizer factory. |
class |
RegExFilteredTokenizerFactory
A
RegExFilteredTokenizerFactory modifies the tokens
returned by a base tokenizer factory's tokizer by removing
those that do not match a regular expression pattern. |
class |
RegExTokenizerFactory
A
RegExTokenizerFactory creates a tokenizer factory
out of a regular expression. |
class |
SoundexTokenizerFactory
A
SoundexTokenizerFactory modifies the output of a base
tokenizer factory to produce tokens in soundex representation. |
class |
StopTokenizerFactory
A
StopTokenizerFactory modifies a base tokenizer factory
by removing tokens in a specified stop set. |
class |
TokenLengthTokenizerFactory
A
TokenLengthTokenizerFactory filters the tokenizers produced
by a base tokenizer to only return tokens between specified lower and
upper length limits. |
class |
TokenNGramTokenizerFactory
A
TokenNGramTokenizerFactory wraps a base tokenizer to
produce token n-gram tokens of a specified size. |
class |
WhitespaceNormTokenizerFactory
A
WhitespaceNormTokenizerFactory filters the tokenizers produced
by a base tokenizer factory to convert non-empty whitespaces to a single
space and leave empty (zero-length) whitespaces alone. |
| Modifier and Type | Field and Description |
|---|---|
static TokenizerFactory |
CharacterTokenizerFactory.INSTANCE
An instance of a character tokenizer factory, which may be used
wherever a character tokenizer factory is needed.
|
| Modifier and Type | Method and Description |
|---|---|
TokenizerFactory |
TokenNGramTokenizerFactory.baseTokenizerFactory()
Return the base tokenizer factory used to generate
the underlying tokens from which n-grams are
generated.
|
TokenizerFactory |
ModifiedTokenizerFactory.baseTokenizerFactory()
Return the base tokenizer factory.
|
TokenizerFactory |
TokenChunker.tokenizerFactory()
Return the tokenizer factory for this token chunker.
|
| Constructor and Description |
|---|
EnglishStopTokenizerFactory(TokenizerFactory factory)
Construct an English stop tokenizer factory with the
specified base factory.
|
LowerCaseTokenizerFactory(TokenizerFactory factory)
Construct a lowercasing tokenizer factory from
the specified base factory using the locale
Locale.English |
LowerCaseTokenizerFactory(TokenizerFactory factory,
Locale locale)
Construct a lowercasing tokenizer factory from the
specified base factory using the specified locale.
|
ModifiedTokenizerFactory(TokenizerFactory baseFactory)
Construct a modified tokenizer factory with the
specified base factory.
|
ModifyTokenTokenizerFactory(TokenizerFactory factory)
Construct a token-modifying tokenizer factory with
the specified base factory.
|
PorterStemmerTokenizerFactory(TokenizerFactory factory)
Construct a tokenizer factory that applies Porter stemming
to the tokenizers produced by the specified base factory.
|
RegExFilteredTokenizerFactory(TokenizerFactory factory,
Pattern pattern)
Construct a regular-expression filtered tokenizer factory from
the specified base factory and regular expression pattern that
accepted tokens must match.
|
SoundexTokenizerFactory(TokenizerFactory factory)
Construct a Soundex-based tokenizer factory that converts
tokens produced by the specified base factory into their
soundex representations.
|
StopTokenizerFactory(TokenizerFactory factory,
Set<String> stopSet)
Construct a tokenizer factory that removes tokens
in the specified stop set from tokenizers produced
by the specified base factory.
|
TokenChunker(TokenizerFactory tokenizerFactory)
Construct a chunker from the specified tokenizer
factory.
|
TokenFeatureExtractor(TokenizerFactory factory)
Construct a token-based feature extractor from the
specified tokenizer factory.
|
Tokenization(char[] cs,
int start,
int length,
TokenizerFactory factory)
Construct a tokenization from the specified text and tokenizer
factory.
|
Tokenization(String text,
TokenizerFactory factory)
Construct a tokenization from the specified text and tokenizer
factory.
|
TokenLengthTokenizerFactory(TokenizerFactory factory,
int shortestTokenLength,
int longestTokenLength)
Construct a token-length filtered tokenizer factory from the
specified factory that removes tokens shorter than the shortest
or longer than the longest length.
|
TokenNGramTokenizerFactory(TokenizerFactory factory,
int min,
int max)
Construct a token n-gram tokenizer factory using the
specified base factory that produces n-grams within the
specified minimum and maximum length bounds.
|
WhitespaceNormTokenizerFactory(TokenizerFactory factory)
Construct a whitespace normalizing tokenizer factory from the
specified base factory.
|
Copyright © 2016 Alias-i, Inc.. All rights reserved.