E - the type of objects being clusteredpublic class KMeansClusterer<E> extends Object implements Clusterer<E>
KMeansClusterer provides an implementation of
k-means(++) clustering based on vectors constructed by feature
extractors. An instance fixes a specific value of K, the
number of clusters returned. Initialization may be either by
the traditional k-means random assignment, or by the k-means++
initialization strategy.
This clustering class is defined so as to be able to cluster arbitrary objects. These objects are converted to (sparse) vectors using a feature extractor specified during construction.
Feature Parsing to Cluster Objects
The elements being clustered are first converted into feature d-dimensional vectors using a feature extractor. These feature vectors are then evenly distributed among the clusters using random assignment. Feature extractors may normalize their input in any number of ways. Run time is dominated by the density of the object vectors.
Centroids
In k-means, each cluster is modeled by the centroid of the feature vectors assigned to it. The centroid of a set of points is just the mean of the points, computed by dimension:
centroid({v[0],...,v[n-1]}) = (v[0] + ... + v[n-1]) / n
Thus centroids are thus located in the same vector space
as the objects, namely they are d-dimensional vectors.
The code represents centroids as dense vectors, and objects
as sparse vectors.
Euclidean Distance
Feature vectors are always compared to cluster centroids using squared Euclidean distance, defined by:
distance(x,y)2 =(x - y) * (x - y)
= Σi (x[i] - y[i])2)
The centroid of a set of points is the point that minimizes the sum
of square distances from the points in that set to the set.
K-means
The k-means algorithm then iteratively improves cluster assignments. Each epoch consists of two stages, reassignment and recomputing the means.
Cluster Assignment: In each epoch, we assign each object to the cluster represented by the closest centroid. Ties go to the lowest indexed element.
Mean Recomputation: At the end of each epoch, the centroids are then recomputed as the means of points assigned to the cluster.
Convergence: If no objects change cluster during an iteration, the algorithm has converged and the results will be returned. We also consider the algorithm converged if the relative reduction in error from epoch to epoch falls below a set threshold. Also, the algorithm will return if the maximum number of epochs have been reached.
K-means as Minimization
K-means clustering may be viewed as an iterative approach to the minimization of the average sauare distance between items and their cluster centers, which is:
whereErr(cs) = Σc in cs Σx in c distance(x,centroid(x))2
cs is the set of clusters and centroid(x)
is the centroid (mean or average) of the cluster containing x.
Convergence Guarantees
K-means clustering is guaranteed to converge to a local mimium of error because both steps of K-means reduce error. First, assigning each object to its closest centroid minimizes error given the centroids. Second, recalculating centroids as the means of elements assigned to them minimizes errors given the clustering. Given that error is bounded at zero, and changes are discrete, k-means must eventually converge. While there are exponentially many possible clusterings in theory, in practice k-means converges very quickly.
Local Minima and Multiple Runs
Like the EM algorithm, k-means clustering is highly sensitive to the initial assignment of elements. In practice, it is often helpful to apply k-means clustering repeatedly to the same input data, returning the clustering with minimum error.
At the start of each iteration, the error of the previous assignment is reported (at convergence, this will be the final error).
Multiple runs may be used to provide bootstrap estimates of the relatedness of any two elements. Bootstrap estimates work by subsampling the elements to cluster with replacement and then running k-means on them. This is repeated multiple times, and the percentage of runs in which two elements fall in the same cluster forms the bootstrap estimate of the likelihood that they are in the same cluster.
Degenerate Solutions
In some cases, the iterative approach taken by k-means leads to a solution in which not every cluster is populated. This happens during a step in which no feature vector is closest to a given centroid. This is most likely to happen in highly skewed data in high dimensional spaces. Sometimes, rerunning the clusterer with a different initialization will find a solution with k clusters.
Picking a good K
The number of clusters, k, may also be varied. In this case, new k-means clustering instances must be created, as each uses a fixed number of clusters.
By varying k, the maximum number of clusters,
the within-cluster scatter may be compared across different choices
of k. Typically, a value of k is chosen
at a knee of the within-cluster scatter scores. There are automatic
ways of performing this selection, though they are heuristically
rather than theoretically motivated.
In practice, it is technically possible, though unlikely, for clusters to wind up with no elements in them. This implementation will simply return fewer clusters than the maximum specified in this case.
K-Means++ Initialization
K-means++ is a k-means algorithm that attempts to make a good randomized choice of initial centers. This requires choosing a diverse set of centers, but not overpopulating the initial set of centers with outliers.
K-means++ has reasonable expected performance bounds in theory, and quite good performance in practice.
Suppose we have K clusters and a set X of size at least K. K-means++ chooses a single point c[k] in X as each initial centroid, using the following strategy:
1. Sample the first centroid c[1] randomly from X.
2. For k = 2 to K
Sample the next centroid c[k] = x
with probability proportional to D(x)2
where D(x) is the minimum distance to an existing centroid:
After initialization, k-means++ proceeds just as traditional k-means clustering.D(x) = mink' < k d(x,c[k'])
The good expected behavior form k-means++ arises from choosing the next center in such a way that it's far away from existing centers. Many nearby points in some sense pool their behavior, because the chance that one will be picked is the sum of the chance that each will be picked. Outliers are, by definition, points x with high values of D(x) but which are not near other points.
Relation to Gaussian Mixtures and Expectation/Maximization
The k-means clustering algorithm is implicitly based on a multi-dimensional Gaussian with independent dimensions with shared means (the centroids) and equal variances. Estimates are carried out by maximum likelihood; that is, no prior, or equivalently the fully uninformative prior, is used. Where k-means differs from standard expectation/maximization (EM) is that k-means reweights the expectations so that the closest centroid gets an expectation of 1.0 and all other centroids get an expectation of 0.0. This approach has been called "winner-take-all" EM in the EM literature.
Implementation Notes
The current implementation conserves some computations versus the brute-force approach by (1) only computing vector products in comparing two vectors, and (2) not recomputing distances if we know the
References
| Constructor and Description |
|---|
KMeansClusterer(FeatureExtractor<E> featureExtractor,
int numClusters,
int maxEpochs,
boolean kMeansPlusPlus,
double minImprovement)
Construct a k-means clusterer with the specified feature
extractor, number of clusters, minimum improvement per cluster,
using either traditional or k-means++ initialization.
|
| Modifier and Type | Method and Description |
|---|---|
Set<Set<E>> |
cluster(Set<? extends E> elementSet)
Return a k-means clustering of the specified set of elements
using a freshly generated random number generator without
intermediate reporting.
|
Set<Set<E>> |
cluster(Set<? extends E> elementSet,
Random random,
Reporter reporter)
Return the k-means clustering for the specified set of
elements, using the specified random number generator, sending
progress reports to the specified reporter.
|
FeatureExtractor<E> |
featureExtractor()
Returns the feature extractor for this clusterer.
|
int |
maxEpochs()
Returns the maximum number of epochs for this clusterer.
|
double |
minRelativeImprovement()
Returns the minimum reduction in relative error required to
continue to the next epoch during clustering.
|
int |
numClusters()
Returns the maximum number of clusters this clusterer will
return.
|
Set<Set<E>> |
recluster(Set<Set<E>> initialClustering,
Set<E> unclusteredElements,
Reporter reporter)
Recluster the specified initial clustering, adding in the
unclustered elements, reporting progress to the specified
reporter.
|
public KMeansClusterer(FeatureExtractor<E> featureExtractor, int numClusters, int maxEpochs, boolean kMeansPlusPlus, double minImprovement)
If the kMeansPlusPlus flag is set to true, the
k-means++ initialization strategy is used. If it is set to
false, initialization of each cluster will be handled by a
random shuffling of all elements into a cluster.
If the number of epochs is set to zero, the result will be a random balanced clusterings of the specified size.
featureExtractor - Feature extractor for this clusterer.numClusters - Number of clusters to return.maxEpochs - Maximum number of epochs duringkMeansPlusPlus - Set to true to use k-means++
initialization. optimization.minImprovement - Minimum relative improvement in squared
distance scatter to keep going to the next epoch.IllegalArgumentException - If the number of clusters is
less than 1, if the maximum number of epochs is less than 0, or
if the minimum improvement is not a finite, non-negative number.public FeatureExtractor<E> featureExtractor()
public int numClusters()
k" in
"k-means".
Clustering fewer elements will result in fewer clusters.
public int maxEpochs()
public Set<Set<E>> cluster(Set<? extends E> elementSet)
This is just a utility method implementing the
Clusterer interface. A call to
cluster(elementSet) produces the same result
as cluster(elementSet,new Random(),null).
See the class documentation above for more information.
public Set<Set<E>> cluster(Set<? extends E> elementSet, Random random, Reporter reporter)
The reason this is a separate method is that typical
implementations of Random are not thread safe,
and rarely should reports from different clustering runs
be interleaved to a reporter.
Using a fixed random number generator (e.g. by using the
same seed for Random) will result in the same
resulting clustering, which can be useful for replicating
tests.
See the class documentation above for more information.
elementSet - Set of elements to clusterrandom - Random number generatorreporter - Reporter to which progress reports are sent,
or null if no reporting is required.public double minRelativeImprovement()
public Set<Set<E>> recluster(Set<Set<E>> initialClustering, Set<E> unclusteredElements, Reporter reporter)
initialClustering - Initialization of clustering.unclusteredElements - Elements that have not been clustered.reporter - Reporter to which to send progress reports,
or null for no reporting.IllegalArgumentException - If there are empty clusters
in the clustering or if an element belongs to more than
one cluster.Copyright © 2016 Alias-i, Inc.. All rights reserved.