| BigramTokenizer |
Advanced tokenizer that lowercases, adds start and end tags, deduplicates
tokens and builds bigrams.
|
| DocumentSimilarity |
Simply distance measure wrapper for debug string similarity measuring.
|
| HMM |
Hidden Markov Model implementation for multiple observations for all three
types of problems HMM aims to solve (Decoding, likelihood estimation,
unsupervised/supervised learning).
|
| MarkovChain |
Markov chain, that can "learn" the state transition probabilities by a given
input and returns the probability for a given sequence of states.
|
| MinHash |
Linear MinHash algorithm to find near duplicates faster or to speedup nearest
neighbour searches.
|
| SparseVectorDocumentMapper |
Mapper that maps sparse vectors into a set of their indices so they can be
used in the InvertedIndex for fast lookup.
|
| StandardTokenizer |
Just a basic tokenizer by certain attributes with normalization.
|
| TokenizerUtils |
Nifty text utility for majorly tokenizing tasks.
|
| VectorizerUtils |
Vectorizing utility for basic tf-idf and wordcount vectorizing of
tokens/strings.
|