Package de.jungblut.nlp
Class VectorizerUtils
- java.lang.Object
-
- de.jungblut.nlp.VectorizerUtils
-
public final class VectorizerUtils extends java.lang.ObjectVectorizing utility for basic tf-idf and wordcount vectorizing of tokens/strings. It can also build inverted indices and dictionaries.- Author:
- thomas.jungblut
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.StringOUT_OF_VOCABULARY
-
Constructor Summary
Constructors Constructor Description VectorizerUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.lang.String[]buildDictionary(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)Builds a sorted dictionary of tokens from a list of (tokenized) documents.static java.lang.String[]buildDictionary(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments, float stopWordPercentage, int minFrequency)Builds a sorted dictionary of tokens from a list of (tokenized) documents.static int[][]buildInvertedIndexArray(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Builds an inverted index based on the given dictionary, adds just the document index mappings to it.static int[]buildInvertedIndexDocumentCount(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Builds an inverted index document count based on the given dictionary, so at each dimension of the returned array, there is a count of how many documents contained that token.static com.google.common.collect.HashMultimap<java.lang.String,java.lang.Integer>buildInvertedIndexMap(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Builds an inverted index as multi map.static int[]buildTransitionVector(java.lang.String[] dict, java.lang.String[] doc)Builds a transition array by traversing the documents and checking the dictionary.static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>>getMostFrequentItems(com.google.common.collect.Multiset<E> set)Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency.static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>>getMostFrequentItems(com.google.common.collect.Multiset<E> set, com.google.common.base.Predicate<com.google.common.collect.Multiset.Entry<E>> filter)Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency.static de.jungblut.math.DoubleVector[]hashVectorize(de.jungblut.math.DoubleVector[] features, int n, com.google.common.hash.HashFunction hashFunction)Hashes the given vectors into a new representation of a new n-dimensional feature space.static de.jungblut.math.DoubleVectorhashVectorize(de.jungblut.math.DoubleVector inputFeature, int n, com.google.common.hash.HashFunction hashFunction)Hashes the given vector into a new representation of a new n-dimensional feature space.static de.jungblut.math.DoubleVectorsparseHashVectorize(java.lang.String[] doc, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)Uses the hashing trick to provide a sparse numeric representation of the given input.static java.util.stream.Stream<de.jungblut.math.DoubleVector>sparseHashVectorize(java.util.stream.Stream<java.lang.String[]> documents, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)Uses the hashing trick to provide a sparse numeric representation of the given input.static de.jungblut.math.DoubleVectortfIdfVectorize(int numDocuments, java.lang.String[] document, java.lang.String[] dictionary, int[] termDocumentCount)Vectorizes the given single document by the TF-IDF weighting.static java.util.List<de.jungblut.math.DoubleVector>tfIdfVectorize(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary, int[] termDocumentCount)Vectorizes the given documents by the TF-IDF weighting.static java.util.stream.Stream<de.jungblut.math.DoubleVector>wordFrequencyVectorize(java.lang.String[]... vars)Vectorizes a given list of documents.static java.util.stream.Stream<de.jungblut.math.DoubleVector>wordFrequencyVectorize(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)Vectorizes a given list of documents.static java.util.stream.Stream<de.jungblut.math.DoubleVector>wordFrequencyVectorize(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Vectorizes a given list of documents and a dictionary.
-
-
-
Field Detail
-
OUT_OF_VOCABULARY
public static final java.lang.String OUT_OF_VOCABULARY
- See Also:
- Constant Field Values
-
-
Method Detail
-
buildDictionary
public static java.lang.String[] buildDictionary(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)
Builds a sorted dictionary of tokens from a list of (tokenized) documents. It treats tokens that are contained in at least 90% of all documents as spam, they won't be included in the final dictionary.This method is compatible to parallel streams.
- Parameters:
tokenizedDocuments- the documents that are already tokenized.- Returns:
- a sorted String array with tokens in it.
-
buildDictionary
public static java.lang.String[] buildDictionary(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments, float stopWordPercentage, int minFrequency)Builds a sorted dictionary of tokens from a list of (tokenized) documents. It treats tokens that are contained in at least "stopWordPercentage"% of all documents as spam, they won't be included in the final dictionary.This method is compatible to parallel streams.
- Parameters:
tokenizedDocuments- the documents that are the base for the dictionary.stopWordPercentage- the percentage of how many documents must contain a token until it can be classified as spam. Ranges between 0f and 1f, where 0f will actually return an empty dictionary.minFrequency- the minimum frequency a token must occur globally. (strict greater than supplied value)- Returns:
- a sorted String array with tokens in it.
-
buildTransitionVector
public static int[] buildTransitionVector(java.lang.String[] dict, java.lang.String[] doc)Builds a transition array by traversing the documents and checking the dictionary. If nothing was found in the dictionary, we set the out of vocabulary index. This transition array is ready to be fed into aMarkovChain.- Parameters:
dict- the dictionary.doc- the document to build a transition.- Returns:
- the transition array.
-
buildInvertedIndexMap
public static com.google.common.collect.HashMultimap<java.lang.String,java.lang.Integer> buildInvertedIndexMap(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Builds an inverted index as multi map.- Parameters:
tokenizedDocuments- the documents to index, already tokenized.dictionary- the dictionary of words that should be used to build this index.- Returns:
- a
HashMultimapthat contains a set of integers (index of the documents in the given input list) mapped by a token that was contained in the documents.
-
buildInvertedIndexArray
public static int[][] buildInvertedIndexArray(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Builds an inverted index based on the given dictionary, adds just the document index mappings to it.- Parameters:
tokenizedDocuments- the documents to index, already tokenized.dictionary- the dictionary of words that should be used to build this index.- Returns:
- a two dimensional integer array, that contains the document ids (index in the given document list) on the same index that the dictionary maps the token.
-
buildInvertedIndexDocumentCount
public static int[] buildInvertedIndexDocumentCount(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Builds an inverted index document count based on the given dictionary, so at each dimension of the returned array, there is a count of how many documents contained that token.- Parameters:
tokenizedDocuments- the documents to index, already tokenized.dictionary- the dictionary of words that should be used to build this index.- Returns:
- a one dimensional integer array, that contains the number of documents on the same index that the dictionary maps the token.
-
wordFrequencyVectorize
public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize(java.lang.String[]... vars)
Vectorizes a given list of documents. Each vector will have the dimension of how many words are in the build dictionary, each word will have its own mapping in the vector. The value at a certain index (determined by the position in the dictionary) will be the frequncy of the word in the document.- Parameters:
tokenizedDocuments- the array of documents.- Returns:
- a stream of sparse vectors, representing the documents as vectors based on word frequency.
-
wordFrequencyVectorize
public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)
Vectorizes a given list of documents. Each vector will have the dimension of how many words are in the build dictionary, each word will have its own mapping in the vector. The value at a certain index (determined by the position in the dictionary) will be the frequncy of the word in the document.- Parameters:
tokenizedDocuments- the list of documents.- Returns:
- a stream of sparse vectors, representing the documents as vectors based on word frequency.
-
wordFrequencyVectorize
public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)Vectorizes a given list of documents and a dictionary. Each vector will have the dimension of how many words are in the dictionary, each word will have its own mapping in the vector. The value at a certain index (determined by the position in the dictionary) will be the frequncy of the word in the document.- Parameters:
tokenizedDocuments- the list of documents.dictionary- the dictionary, must be sorted.- Returns:
- a stream of sparse vectors, representing the documents as vectors based on word frequency.
-
tfIdfVectorize
public static java.util.List<de.jungblut.math.DoubleVector> tfIdfVectorize(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary, int[] termDocumentCount)Vectorizes the given documents by the TF-IDF weighting.- Parameters:
tokenizedDocuments- the documents to vectorize.dictionary- the dictionary extracted.termDocumentCount- the document count per token. The information can be retrieved throughbuildInvertedIndexDocumentCount(List, String[]).- Returns:
- a list of sparse tf-idf weighted vectors.
-
tfIdfVectorize
public static de.jungblut.math.DoubleVector tfIdfVectorize(int numDocuments, java.lang.String[] document, java.lang.String[] dictionary, int[] termDocumentCount)Vectorizes the given single document by the TF-IDF weighting.- Parameters:
numDocuments- the number of documents used in the corpus.document- the document to vectorize.dictionary- the dictionary extracted.termDocumentCount- the document count per token.- Returns:
- a sparse tf-idf weighted vectors.
-
getMostFrequentItems
public static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems(com.google.common.collect.Multiset<E> set)
Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency.- Parameters:
set- the given multiset.- Returns:
- a descending sorted list by frequency.
-
getMostFrequentItems
public static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems(com.google.common.collect.Multiset<E> set, com.google.common.base.Predicate<com.google.common.collect.Multiset.Entry<E>> filter)Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency. Also can apply a filter on the multiset, for example a filter for wordfrequency > 1.- Parameters:
set- the given multiset.filter- if not null it filters by the givenPredicate.- Returns:
- a descending sorted list by frequency.
-
hashVectorize
public static de.jungblut.math.DoubleVector hashVectorize(de.jungblut.math.DoubleVector inputFeature, int n, com.google.common.hash.HashFunction hashFunction)Hashes the given vector into a new representation of a new n-dimensional feature space. The hash beeing is used on the non-zero feature index. Thus this vectorization method should be used for text data, that has a sparse representation of its features.- Parameters:
inputFeature- the (usually) sparse feature vector.n- the target dimension of the vector.hashFunction- the hashfunction to use. For example: Hashing.murmur3_128().- Returns:
- the new n-dimensional dense vector vectorized via the hashing trick.
-
hashVectorize
public static de.jungblut.math.DoubleVector[] hashVectorize(de.jungblut.math.DoubleVector[] features, int n, com.google.common.hash.HashFunction hashFunction)Hashes the given vectors into a new representation of a new n-dimensional feature space. The hash beeing used is the Murmur128 Bit hashing function on the non-zero feature index. Thus this vectorization method should be used for text data, that has a sparse representation of its features.- Parameters:
seedVector- the (usually) sparse feature vector.n- the target dimension of the vector.hashFunction- the hashfunction to use. For example: Hashing.murmur3_128().- Returns:
- the new n-dimensional dense vectors vectorized via the hashing trick.
-
sparseHashVectorize
public static java.util.stream.Stream<de.jungblut.math.DoubleVector> sparseHashVectorize(java.util.stream.Stream<java.lang.String[]> documents, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)Uses the hashing trick to provide a sparse numeric representation of the given input. This is different fromhashVectorize(DoubleVector, int, com.google.common.hash.HashFunction), as it takes raw tokenized documents directly and only using their hash values to find the respective index in the newly created vector.- Parameters:
documents- the tokenized documents.hashFunction- the hasher. This will be ignored when a parallel stream is passed, in this case it will use theString.hashCode(), as it is thread-safe.factory- to create a new vector of size x- Returns:
- a stream of DoubleVectors
-
sparseHashVectorize
public static de.jungblut.math.DoubleVector sparseHashVectorize(java.lang.String[] doc, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)Uses the hashing trick to provide a sparse numeric representation of the given input. This is different fromhashVectorize(DoubleVector, int, com.google.common.hash.HashFunction), as it takes raw tokenized documents directly and only using their hash values to find the respective index in the newly created vector.- Parameters:
documents- the tokenized documents.hashFunction- the hasher. If null it will use the Java hashcode for strings.factory- to create a new vector of size x- Returns:
- a DoubleVector
-
-