Class MinHash


  • public final class MinHash
    extends java.lang.Object
    Linear MinHash algorithm to find near duplicates faster or to speedup nearest neighbour searches.
    Author:
    thomas.jungblut
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  MinHash.HashType  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static MinHash create​(int numHashes)
      Creates a MinHash instance with the given number of hash functions with a linear hashing function.
      static MinHash create​(int numHashes, long seed)
      Creates a MinHash instance with the given number of hash functions and a seed to be used in parallel systems.
      static MinHash create​(int numHashes, MinHash.HashType type)
      Creates a MinHash instance with the given number of hash functions.
      static MinHash create​(int numHashes, MinHash.HashType type, long seed)
      Creates a MinHash instance with the given number of hash functions and a seed to be used in parallel systems.
      java.util.Set<java.lang.String> createClusterKeys​(int[] minHashes, int keyGroups)
      Generates cluster keys from the minhashes.
      double measureSimilarity​(int[] left, int[] right)
      Measures the similarity between two min hash arrays by comparing the hashes at the same index.
      int[] minHashVector​(de.jungblut.math.DoubleVector vector)
      Minhashes the given vector by iterating over all non zero items and hashing each byte in its value (as an integer).
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • minHashVector

        public int[] minHashVector​(de.jungblut.math.DoubleVector vector)
        Minhashes the given vector by iterating over all non zero items and hashing each byte in its value (as an integer). So it will end up with 4 bytes to be hashed into a single integer by a linear hash function.
        Parameters:
        vector - a arbitrary vector.
        Returns:
        a int array of min hashes based on how many hashes were configured.
      • measureSimilarity

        public double measureSimilarity​(int[] left,
                                        int[] right)
        Measures the similarity between two min hash arrays by comparing the hashes at the same index. This is assuming that both arrays having the same size.
        Returns:
        a similarity between 0 and 1, where 1 is very similar.
      • createClusterKeys

        public java.util.Set<java.lang.String> createClusterKeys​(int[] minHashes,
                                                                 int keyGroups)
        Generates cluster keys from the minhashes. Make sure that if you are going to lookup the ids in a hashtable, sort out these that don't have a specific minimum occurence. Also make sure that if you're using this in parallel, you have to make sure that the seeds of the minhash should be consistent across each task. Otherwise this key will be completely random.
        Parameters:
        keyGroups - how many keygroups there should be, normally it's just a single per hash.
        Returns:
        a set of string IDs that can refer as cluster identifiers.
      • create

        public static MinHash create​(int numHashes)
        Creates a MinHash instance with the given number of hash functions with a linear hashing function.
      • create

        public static MinHash create​(int numHashes,
                                     long seed)
        Creates a MinHash instance with the given number of hash functions and a seed to be used in parallel systems. This method uses a linear hashfunction.
      • create

        public static MinHash create​(int numHashes,
                                     MinHash.HashType type)
        Creates a MinHash instance with the given number of hash functions.
      • create

        public static MinHash create​(int numHashes,
                                     MinHash.HashType type,
                                     long seed)
        Creates a MinHash instance with the given number of hash functions and a seed to be used in parallel systems.