public final class VectorizerUtils extends Object
| Modifier and Type | Field and Description |
|---|---|
static String |
OUT_OF_VOCABULARY |
| Constructor and Description |
|---|
VectorizerUtils() |
| Modifier and Type | Method and Description |
|---|---|
static String[] |
buildDictionary(java.util.stream.Stream<String[]> tokenizedDocuments)
Builds a sorted dictionary of tokens from a list of (tokenized) documents.
|
static String[] |
buildDictionary(java.util.stream.Stream<String[]> tokenizedDocuments,
float stopWordPercentage,
int minFrequency)
Builds a sorted dictionary of tokens from a list of (tokenized) documents.
|
static int[][] |
buildInvertedIndexArray(List<String[]> tokenizedDocuments,
String[] dictionary)
Builds an inverted index based on the given dictionary, adds just the
document index mappings to it.
|
static int[] |
buildInvertedIndexDocumentCount(List<String[]> tokenizedDocuments,
String[] dictionary)
Builds an inverted index document count based on the given dictionary, so
at each dimension of the returned array, there is a count of how many
documents contained that token.
|
static com.google.common.collect.HashMultimap<String,Integer> |
buildInvertedIndexMap(List<String[]> tokenizedDocuments,
String[] dictionary)
Builds an inverted index as multi map.
|
static int[] |
buildTransitionVector(String[] dict,
String[] doc)
Builds a transition array by traversing the documents and checking the
dictionary.
|
static <E> ArrayList<com.google.common.collect.Multiset.Entry<E>> |
getMostFrequentItems(com.google.common.collect.Multiset<E> set)
Given a multiset of generic elements we are going to return a list of all
the elements, sorted descending by their frequency.
|
static <E> ArrayList<com.google.common.collect.Multiset.Entry<E>> |
getMostFrequentItems(com.google.common.collect.Multiset<E> set,
com.google.common.base.Predicate<com.google.common.collect.Multiset.Entry<E>> filter)
Given a multiset of generic elements we are going to return a list of all
the elements, sorted descending by their frequency.
|
static de.jungblut.math.DoubleVector[] |
hashVectorize(de.jungblut.math.DoubleVector[] features,
int n,
com.google.common.hash.HashFunction hashFunction)
Hashes the given vectors into a new representation of a new n-dimensional
feature space.
|
static de.jungblut.math.DoubleVector |
hashVectorize(de.jungblut.math.DoubleVector inputFeature,
int n,
com.google.common.hash.HashFunction hashFunction)
Hashes the given vector into a new representation of a new n-dimensional
feature space.
|
static java.util.stream.Stream<de.jungblut.math.DoubleVector> |
sparseHashVectorize(java.util.stream.Stream<String[]> documents,
com.google.common.hash.HashFunction hashFunction,
java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
Uses the hashing trick to provide a sparse numeric representation of the
given input.
|
static de.jungblut.math.DoubleVector |
sparseHashVectorize(String[] doc,
com.google.common.hash.HashFunction hashFunction,
java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
Uses the hashing trick to provide a sparse numeric representation of the
given input.
|
static de.jungblut.math.DoubleVector |
tfIdfVectorize(int numDocuments,
String[] document,
String[] dictionary,
int[] termDocumentCount)
Vectorizes the given single document by the TF-IDF weighting.
|
static List<de.jungblut.math.DoubleVector> |
tfIdfVectorize(List<String[]> tokenizedDocuments,
String[] dictionary,
int[] termDocumentCount)
Vectorizes the given documents by the TF-IDF weighting.
|
static java.util.stream.Stream<de.jungblut.math.DoubleVector> |
wordFrequencyVectorize(java.util.stream.Stream<String[]> tokenizedDocuments)
Vectorizes a given list of documents.
|
static java.util.stream.Stream<de.jungblut.math.DoubleVector> |
wordFrequencyVectorize(java.util.stream.Stream<String[]> tokenizedDocuments,
String[] dictionary)
Vectorizes a given list of documents and a dictionary.
|
static java.util.stream.Stream<de.jungblut.math.DoubleVector> |
wordFrequencyVectorize(String[]... vars)
Vectorizes a given list of documents.
|
public static final String OUT_OF_VOCABULARY
public static String[] buildDictionary(java.util.stream.Stream<String[]> tokenizedDocuments)
tokenizedDocuments - the documents that are already tokenized.public static String[] buildDictionary(java.util.stream.Stream<String[]> tokenizedDocuments, float stopWordPercentage, int minFrequency)
tokenizedDocuments - the documents that are the base for the
dictionary.stopWordPercentage - the percentage of how many documents must contain
a token until it can be classified as spam. Ranges between 0f and
1f, where 0f will actually return an empty dictionary.minFrequency - the minimum frequency a token must occur globally.
(strict greater than supplied value)public static int[] buildTransitionVector(String[] dict, String[] doc)
MarkovChain.dict - the dictionary.doc - the document to build a transition.public static com.google.common.collect.HashMultimap<String,Integer> buildInvertedIndexMap(List<String[]> tokenizedDocuments, String[] dictionary)
tokenizedDocuments - the documents to index, already tokenized.dictionary - the dictionary of words that should be used to build this
index.HashMultimap that contains a set of integers (index of
the documents in the given input list) mapped by a token that was
contained in the documents.public static int[][] buildInvertedIndexArray(List<String[]> tokenizedDocuments, String[] dictionary)
tokenizedDocuments - the documents to index, already tokenized.dictionary - the dictionary of words that should be used to build this
index.public static int[] buildInvertedIndexDocumentCount(List<String[]> tokenizedDocuments, String[] dictionary)
tokenizedDocuments - the documents to index, already tokenized.dictionary - the dictionary of words that should be used to build this
index.public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize(String[]... vars)
tokenizedDocuments - the array of documents.public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize(java.util.stream.Stream<String[]> tokenizedDocuments)
tokenizedDocuments - the list of documents.public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize(java.util.stream.Stream<String[]> tokenizedDocuments, String[] dictionary)
tokenizedDocuments - the list of documents.dictionary - the dictionary, must be sorted.public static List<de.jungblut.math.DoubleVector> tfIdfVectorize(List<String[]> tokenizedDocuments, String[] dictionary, int[] termDocumentCount)
tokenizedDocuments - the documents to vectorize.dictionary - the dictionary extracted.termDocumentCount - the document count per token. The information can
be retrieved through
buildInvertedIndexDocumentCount(List, String[]).public static de.jungblut.math.DoubleVector tfIdfVectorize(int numDocuments,
String[] document,
String[] dictionary,
int[] termDocumentCount)
numDocuments - the number of documents used in the corpus.document - the document to vectorize.dictionary - the dictionary extracted.termDocumentCount - the document count per token.public static <E> ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems(com.google.common.collect.Multiset<E> set)
set - the given multiset.public static <E> ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems(com.google.common.collect.Multiset<E> set, com.google.common.base.Predicate<com.google.common.collect.Multiset.Entry<E>> filter)
set - the given multiset.filter - if not null it filters by the given Predicate.public static de.jungblut.math.DoubleVector hashVectorize(de.jungblut.math.DoubleVector inputFeature,
int n,
com.google.common.hash.HashFunction hashFunction)
inputFeature - the (usually) sparse feature vector.n - the target dimension of the vector.hashFunction - the hashfunction to use. For example:
Hashing.murmur3_128().public static de.jungblut.math.DoubleVector[] hashVectorize(de.jungblut.math.DoubleVector[] features,
int n,
com.google.common.hash.HashFunction hashFunction)
seedVector - the (usually) sparse feature vector.n - the target dimension of the vector.hashFunction - the hashfunction to use. For example:
Hashing.murmur3_128().public static java.util.stream.Stream<de.jungblut.math.DoubleVector> sparseHashVectorize(java.util.stream.Stream<String[]> documents, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
hashVectorize(DoubleVector, int, com.google.common.hash.HashFunction)
, as it takes raw tokenized documents directly and only using their hash
values to find the respective index in the newly created vector.documents - the tokenized documents.hashFunction - the hasher. This will be ignored when a parallel stream
is passed, in this case it will use the String.hashCode(),
as it is thread-safe.factory - to create a new vector of size xpublic static de.jungblut.math.DoubleVector sparseHashVectorize(String[] doc, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
hashVectorize(DoubleVector, int, com.google.common.hash.HashFunction)
, as it takes raw tokenized documents directly and only using their hash
values to find the respective index in the newly created vector.documents - the tokenized documents.hashFunction - the hasher. If null it will use the Java hashcode for
strings.factory - to create a new vector of size xCopyright © 2016. All rights reserved.