Class VectorizerUtils


  • public final class VectorizerUtils
    extends java.lang.Object
    Vectorizing utility for basic tf-idf and wordcount vectorizing of tokens/strings. It can also build inverted indices and dictionaries.
    Author:
    thomas.jungblut
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.lang.String OUT_OF_VOCABULARY  
    • Constructor Summary

      Constructors 
      Constructor Description
      VectorizerUtils()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String[] buildDictionary​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)
      Builds a sorted dictionary of tokens from a list of (tokenized) documents.
      static java.lang.String[] buildDictionary​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments, float stopWordPercentage, int minFrequency)
      Builds a sorted dictionary of tokens from a list of (tokenized) documents.
      static int[][] buildInvertedIndexArray​(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)
      Builds an inverted index based on the given dictionary, adds just the document index mappings to it.
      static int[] buildInvertedIndexDocumentCount​(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)
      Builds an inverted index document count based on the given dictionary, so at each dimension of the returned array, there is a count of how many documents contained that token.
      static com.google.common.collect.HashMultimap<java.lang.String,​java.lang.Integer> buildInvertedIndexMap​(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)
      Builds an inverted index as multi map.
      static int[] buildTransitionVector​(java.lang.String[] dict, java.lang.String[] doc)
      Builds a transition array by traversing the documents and checking the dictionary.
      static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems​(com.google.common.collect.Multiset<E> set)
      Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency.
      static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems​(com.google.common.collect.Multiset<E> set, com.google.common.base.Predicate<com.google.common.collect.Multiset.Entry<E>> filter)
      Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency.
      static de.jungblut.math.DoubleVector[] hashVectorize​(de.jungblut.math.DoubleVector[] features, int n, com.google.common.hash.HashFunction hashFunction)
      Hashes the given vectors into a new representation of a new n-dimensional feature space.
      static de.jungblut.math.DoubleVector hashVectorize​(de.jungblut.math.DoubleVector inputFeature, int n, com.google.common.hash.HashFunction hashFunction)
      Hashes the given vector into a new representation of a new n-dimensional feature space.
      static de.jungblut.math.DoubleVector sparseHashVectorize​(java.lang.String[] doc, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
      Uses the hashing trick to provide a sparse numeric representation of the given input.
      static java.util.stream.Stream<de.jungblut.math.DoubleVector> sparseHashVectorize​(java.util.stream.Stream<java.lang.String[]> documents, com.google.common.hash.HashFunction hashFunction, java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
      Uses the hashing trick to provide a sparse numeric representation of the given input.
      static de.jungblut.math.DoubleVector tfIdfVectorize​(int numDocuments, java.lang.String[] document, java.lang.String[] dictionary, int[] termDocumentCount)
      Vectorizes the given single document by the TF-IDF weighting.
      static java.util.List<de.jungblut.math.DoubleVector> tfIdfVectorize​(java.util.List<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary, int[] termDocumentCount)
      Vectorizes the given documents by the TF-IDF weighting.
      static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize​(java.lang.String[]... vars)
      Vectorizes a given list of documents.
      static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)
      Vectorizes a given list of documents.
      static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments, java.lang.String[] dictionary)
      Vectorizes a given list of documents and a dictionary.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • OUT_OF_VOCABULARY

        public static final java.lang.String OUT_OF_VOCABULARY
        See Also:
        Constant Field Values
    • Constructor Detail

      • VectorizerUtils

        public VectorizerUtils()
    • Method Detail

      • buildDictionary

        public static java.lang.String[] buildDictionary​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)
        Builds a sorted dictionary of tokens from a list of (tokenized) documents. It treats tokens that are contained in at least 90% of all documents as spam, they won't be included in the final dictionary.

        This method is compatible to parallel streams.

        Parameters:
        tokenizedDocuments - the documents that are already tokenized.
        Returns:
        a sorted String array with tokens in it.
      • buildDictionary

        public static java.lang.String[] buildDictionary​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments,
                                                         float stopWordPercentage,
                                                         int minFrequency)
        Builds a sorted dictionary of tokens from a list of (tokenized) documents. It treats tokens that are contained in at least "stopWordPercentage"% of all documents as spam, they won't be included in the final dictionary.

        This method is compatible to parallel streams.

        Parameters:
        tokenizedDocuments - the documents that are the base for the dictionary.
        stopWordPercentage - the percentage of how many documents must contain a token until it can be classified as spam. Ranges between 0f and 1f, where 0f will actually return an empty dictionary.
        minFrequency - the minimum frequency a token must occur globally. (strict greater than supplied value)
        Returns:
        a sorted String array with tokens in it.
      • buildTransitionVector

        public static int[] buildTransitionVector​(java.lang.String[] dict,
                                                  java.lang.String[] doc)
        Builds a transition array by traversing the documents and checking the dictionary. If nothing was found in the dictionary, we set the out of vocabulary index. This transition array is ready to be fed into a MarkovChain.
        Parameters:
        dict - the dictionary.
        doc - the document to build a transition.
        Returns:
        the transition array.
      • buildInvertedIndexMap

        public static com.google.common.collect.HashMultimap<java.lang.String,​java.lang.Integer> buildInvertedIndexMap​(java.util.List<java.lang.String[]> tokenizedDocuments,
                                                                                                                             java.lang.String[] dictionary)
        Builds an inverted index as multi map.
        Parameters:
        tokenizedDocuments - the documents to index, already tokenized.
        dictionary - the dictionary of words that should be used to build this index.
        Returns:
        a HashMultimap that contains a set of integers (index of the documents in the given input list) mapped by a token that was contained in the documents.
      • buildInvertedIndexArray

        public static int[][] buildInvertedIndexArray​(java.util.List<java.lang.String[]> tokenizedDocuments,
                                                      java.lang.String[] dictionary)
        Builds an inverted index based on the given dictionary, adds just the document index mappings to it.
        Parameters:
        tokenizedDocuments - the documents to index, already tokenized.
        dictionary - the dictionary of words that should be used to build this index.
        Returns:
        a two dimensional integer array, that contains the document ids (index in the given document list) on the same index that the dictionary maps the token.
      • buildInvertedIndexDocumentCount

        public static int[] buildInvertedIndexDocumentCount​(java.util.List<java.lang.String[]> tokenizedDocuments,
                                                            java.lang.String[] dictionary)
        Builds an inverted index document count based on the given dictionary, so at each dimension of the returned array, there is a count of how many documents contained that token.
        Parameters:
        tokenizedDocuments - the documents to index, already tokenized.
        dictionary - the dictionary of words that should be used to build this index.
        Returns:
        a one dimensional integer array, that contains the number of documents on the same index that the dictionary maps the token.
      • wordFrequencyVectorize

        public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize​(java.lang.String[]... vars)
        Vectorizes a given list of documents. Each vector will have the dimension of how many words are in the build dictionary, each word will have its own mapping in the vector. The value at a certain index (determined by the position in the dictionary) will be the frequncy of the word in the document.
        Parameters:
        tokenizedDocuments - the array of documents.
        Returns:
        a stream of sparse vectors, representing the documents as vectors based on word frequency.
      • wordFrequencyVectorize

        public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments)
        Vectorizes a given list of documents. Each vector will have the dimension of how many words are in the build dictionary, each word will have its own mapping in the vector. The value at a certain index (determined by the position in the dictionary) will be the frequncy of the word in the document.
        Parameters:
        tokenizedDocuments - the list of documents.
        Returns:
        a stream of sparse vectors, representing the documents as vectors based on word frequency.
      • wordFrequencyVectorize

        public static java.util.stream.Stream<de.jungblut.math.DoubleVector> wordFrequencyVectorize​(java.util.stream.Stream<java.lang.String[]> tokenizedDocuments,
                                                                                                    java.lang.String[] dictionary)
        Vectorizes a given list of documents and a dictionary. Each vector will have the dimension of how many words are in the dictionary, each word will have its own mapping in the vector. The value at a certain index (determined by the position in the dictionary) will be the frequncy of the word in the document.
        Parameters:
        tokenizedDocuments - the list of documents.
        dictionary - the dictionary, must be sorted.
        Returns:
        a stream of sparse vectors, representing the documents as vectors based on word frequency.
      • tfIdfVectorize

        public static java.util.List<de.jungblut.math.DoubleVector> tfIdfVectorize​(java.util.List<java.lang.String[]> tokenizedDocuments,
                                                                                   java.lang.String[] dictionary,
                                                                                   int[] termDocumentCount)
        Vectorizes the given documents by the TF-IDF weighting.
        Parameters:
        tokenizedDocuments - the documents to vectorize.
        dictionary - the dictionary extracted.
        termDocumentCount - the document count per token. The information can be retrieved through buildInvertedIndexDocumentCount(List, String[]).
        Returns:
        a list of sparse tf-idf weighted vectors.
      • tfIdfVectorize

        public static de.jungblut.math.DoubleVector tfIdfVectorize​(int numDocuments,
                                                                   java.lang.String[] document,
                                                                   java.lang.String[] dictionary,
                                                                   int[] termDocumentCount)
        Vectorizes the given single document by the TF-IDF weighting.
        Parameters:
        numDocuments - the number of documents used in the corpus.
        document - the document to vectorize.
        dictionary - the dictionary extracted.
        termDocumentCount - the document count per token.
        Returns:
        a sparse tf-idf weighted vectors.
      • getMostFrequentItems

        public static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems​(com.google.common.collect.Multiset<E> set)
        Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency.
        Parameters:
        set - the given multiset.
        Returns:
        a descending sorted list by frequency.
      • getMostFrequentItems

        public static <E> java.util.ArrayList<com.google.common.collect.Multiset.Entry<E>> getMostFrequentItems​(com.google.common.collect.Multiset<E> set,
                                                                                                                com.google.common.base.Predicate<com.google.common.collect.Multiset.Entry<E>> filter)
        Given a multiset of generic elements we are going to return a list of all the elements, sorted descending by their frequency. Also can apply a filter on the multiset, for example a filter for wordfrequency > 1.
        Parameters:
        set - the given multiset.
        filter - if not null it filters by the given Predicate.
        Returns:
        a descending sorted list by frequency.
      • hashVectorize

        public static de.jungblut.math.DoubleVector hashVectorize​(de.jungblut.math.DoubleVector inputFeature,
                                                                  int n,
                                                                  com.google.common.hash.HashFunction hashFunction)
        Hashes the given vector into a new representation of a new n-dimensional feature space. The hash beeing is used on the non-zero feature index. Thus this vectorization method should be used for text data, that has a sparse representation of its features.
        Parameters:
        inputFeature - the (usually) sparse feature vector.
        n - the target dimension of the vector.
        hashFunction - the hashfunction to use. For example: Hashing.murmur3_128().
        Returns:
        the new n-dimensional dense vector vectorized via the hashing trick.
      • hashVectorize

        public static de.jungblut.math.DoubleVector[] hashVectorize​(de.jungblut.math.DoubleVector[] features,
                                                                    int n,
                                                                    com.google.common.hash.HashFunction hashFunction)
        Hashes the given vectors into a new representation of a new n-dimensional feature space. The hash beeing used is the Murmur128 Bit hashing function on the non-zero feature index. Thus this vectorization method should be used for text data, that has a sparse representation of its features.
        Parameters:
        seedVector - the (usually) sparse feature vector.
        n - the target dimension of the vector.
        hashFunction - the hashfunction to use. For example: Hashing.murmur3_128().
        Returns:
        the new n-dimensional dense vectors vectorized via the hashing trick.
      • sparseHashVectorize

        public static java.util.stream.Stream<de.jungblut.math.DoubleVector> sparseHashVectorize​(java.util.stream.Stream<java.lang.String[]> documents,
                                                                                                 com.google.common.hash.HashFunction hashFunction,
                                                                                                 java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
        Uses the hashing trick to provide a sparse numeric representation of the given input. This is different from hashVectorize(DoubleVector, int, com.google.common.hash.HashFunction) , as it takes raw tokenized documents directly and only using their hash values to find the respective index in the newly created vector.
        Parameters:
        documents - the tokenized documents.
        hashFunction - the hasher. This will be ignored when a parallel stream is passed, in this case it will use the String.hashCode(), as it is thread-safe.
        factory - to create a new vector of size x
        Returns:
        a stream of DoubleVectors
      • sparseHashVectorize

        public static de.jungblut.math.DoubleVector sparseHashVectorize​(java.lang.String[] doc,
                                                                        com.google.common.hash.HashFunction hashFunction,
                                                                        java.util.function.Supplier<de.jungblut.math.DoubleVector> vectorFactory)
        Uses the hashing trick to provide a sparse numeric representation of the given input. This is different from hashVectorize(DoubleVector, int, com.google.common.hash.HashFunction) , as it takes raw tokenized documents directly and only using their hash values to find the respective index in the newly created vector.
        Parameters:
        documents - the tokenized documents.
        hashFunction - the hasher. If null it will use the Java hashcode for strings.
        factory - to create a new vector of size x
        Returns:
        a DoubleVector