Package de.jungblut.nlp
Class MinHash
- java.lang.Object
-
- de.jungblut.nlp.MinHash
-
public final class MinHash extends java.lang.ObjectLinear MinHash algorithm to find near duplicates faster or to speedup nearest neighbour searches.- Author:
- thomas.jungblut
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classMinHash.HashType
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static MinHashcreate(int numHashes)Creates aMinHashinstance with the given number of hash functions with a linear hashing function.static MinHashcreate(int numHashes, long seed)Creates aMinHashinstance with the given number of hash functions and a seed to be used in parallel systems.static MinHashcreate(int numHashes, MinHash.HashType type)Creates aMinHashinstance with the given number of hash functions.static MinHashcreate(int numHashes, MinHash.HashType type, long seed)Creates aMinHashinstance with the given number of hash functions and a seed to be used in parallel systems.java.util.Set<java.lang.String>createClusterKeys(int[] minHashes, int keyGroups)Generates cluster keys from the minhashes.doublemeasureSimilarity(int[] left, int[] right)Measures the similarity between two min hash arrays by comparing the hashes at the same index.int[]minHashVector(de.jungblut.math.DoubleVector vector)Minhashes the given vector by iterating over all non zero items and hashing each byte in its value (as an integer).
-
-
-
Method Detail
-
minHashVector
public int[] minHashVector(de.jungblut.math.DoubleVector vector)
Minhashes the given vector by iterating over all non zero items and hashing each byte in its value (as an integer). So it will end up with 4 bytes to be hashed into a single integer by a linear hash function.- Parameters:
vector- a arbitrary vector.- Returns:
- a int array of min hashes based on how many hashes were configured.
-
measureSimilarity
public double measureSimilarity(int[] left, int[] right)Measures the similarity between two min hash arrays by comparing the hashes at the same index. This is assuming that both arrays having the same size.- Returns:
- a similarity between 0 and 1, where 1 is very similar.
-
createClusterKeys
public java.util.Set<java.lang.String> createClusterKeys(int[] minHashes, int keyGroups)Generates cluster keys from the minhashes. Make sure that if you are going to lookup the ids in a hashtable, sort out these that don't have a specific minimum occurence. Also make sure that if you're using this in parallel, you have to make sure that the seeds of the minhash should be consistent across each task. Otherwise this key will be completely random.- Parameters:
keyGroups- how many keygroups there should be, normally it's just a single per hash.- Returns:
- a set of string IDs that can refer as cluster identifiers.
-
create
public static MinHash create(int numHashes)
Creates aMinHashinstance with the given number of hash functions with a linear hashing function.
-
create
public static MinHash create(int numHashes, long seed)
Creates aMinHashinstance with the given number of hash functions and a seed to be used in parallel systems. This method uses a linear hashfunction.
-
create
public static MinHash create(int numHashes, MinHash.HashType type)
Creates aMinHashinstance with the given number of hash functions.
-
create
public static MinHash create(int numHashes, MinHash.HashType type, long seed)
Creates aMinHashinstance with the given number of hash functions and a seed to be used in parallel systems.
-
-