E - the type of objects being clusteredpublic abstract class Dendrogram<E> extends Object implements Scored
Dendrogram represents the result of a weighted
hierarchical clustering of a set of objects. A dendrogram's cost
is interpreted as the cost of forming the cluster of its members;
this may be a join cost for a bottom-up (agglomerative) clusterer
or a split cost for a top-down (divisive) clusterer. The minimum
cost is 0.0, and each dendrogram's sub-dendrogram's must have a
lower cost than it.
There are two subclasses of dendrogram, leafs and links. A
LeafDendrogram is a single element cluster with a cost of
0.0. A LinkDendrogram is constructed by merging two
sub-dendrograms with a given cost. The resulting link dendrogram then
becomes the parent of the sub-dendrograms.
Dendrograms that are linked point to their parent dendrogram
through the method parent(). The parent is always a link
dendrogram. The method dereference() returns the result
of following the chain of parent links to the final, top-level
dendrogram. All of the members of a given dendrogram will dereference
to the same parent.
Dendrograms return their set of elements through memberSet(). They may also be partitioned into clusterings in
one of two ways. These clusterings are sets of sets that cover the
members of the dendrogram. The method partitionDistance(double) returns all top-level dendrograms
where all links are within the specified score. The method partitionK(int) returns a partition of the specified size. Both
operate by breaking links in the dendrogram. The thresholded
method breaks all links whose score are above a threshold. The
size-based method breaks links in order of decreasing distance
until the specified number of partitions is generated.
The method copheneticCorrelation(Distance) may be used
to evaluate the distances between objects implied by the dendrogram
and the distances in a specified distance function. The distances
implied by the dendrogram are so-called cophenetic distances. The
cophenetic distance between a pair of elements in the member set of
a dendrogram is the distance of the sub-dendrogram linking them.
Given a distance measure and number of clusters, the method
withinClusterScatter(int,Distance) evaluates the within-cluster
scatter, as defined in the class documentation for ClusterScore.
| Modifier and Type | Method and Description |
|---|---|
boolean |
contains(E elt)
Returns
true if this dendrogram contains the
specified object as a leaf. |
double |
copheneticCorrelation(Distance<? super E> distance)
Returns the Pearson correlation coefficient between the
inter-object distances and the cophenetic distances defined by
this dendrogram.
|
Dendrogram<E> |
dereference()
Returns the top-level cluster of which this dendrogram
is a member.
|
abstract Set<E> |
memberSet()
Returns the members of this dendrogram.
|
LinkDendrogram<E> |
parent()
Returns the dendrogram that immediately contains
this dendrogram or
null if this is a top-level
dendrogram. |
Set<Set<E>> |
partitionDistance(double maxDistance)
Returns the partition produced by cutting this dendrogram at
the specified maximum distance.
|
Set<Set<E>> |
partitionK(int numClusters)
Return the partition of this dendrogram with the specified
number of clusters.
|
String |
prettyPrint()
Returns a pretty-printed version of this dendrogram with
entries on a single line and indenting with scores to show the
merges.
|
abstract double |
score()
Returns the proximity between the components of this
dendrogram.
|
int |
size()
Returns the size of this dendrogram measured by the number
of objects.
|
static <E> boolean |
structurallyEquivalent(Dendrogram<E> dendrogram1,
Dendrogram<E> dendrogram2)
Return
true if the specified dendrograms are
structurally equivalent. |
String |
toString()
Returns a string-based representation of this dendrogram.
|
double |
withinClusterScatter(int numClusters,
Distance<? super E> distance)
Returns the within-cluster scatter relative to the specified
distance of the clustering determined by splitting this
dendrogram into the specified number of clusters.
|
public LinkDendrogram<E> parent()
null if this is a top-level
dendrogram.null
if this is a root dendrogram.public Dendrogram<E> dereference()
null parent
dereferences to itself.public int size()
public boolean contains(E elt)
true if this dendrogram contains the
specified object as a leaf.elt - Object to test.true if this dendrogram contains the
specified object as a leaf.public Set<Set<E>> partitionK(int numClusters)
numClusters - Number of clusters in returned partition.IllegalArgumentException - If the specified number of clusters is
less than one or larger than the number of elements.public Set<Set<E>> partitionDistance(double maxDistance)
This operation may be understood in terms of the isomorphism between equivalence relations and partitions. An equivalence relation is a transitive, symmetric, reflexive relation between objects. An equivalence relation determines a partition with clusters corresponding to sets of mutually equivalent objects. For this method, objects are equivalent if the distance between them is less than or equal to the specified maximum distance in the dendrogram.
maxDistance - Maximum dendrogram distance at which
elements are considered to be part of the same cluster.public String prettyPrint()
public String toString()
public abstract Set<E> memberSet()
public abstract double score()
public double withinClusterScatter(int numClusters,
Distance<? super E> distance)
ClusterScore.withinClusterScatter(partitionK(numClusters),distance).
See
ClusterScore.withinClusterScatter(Set,Distance)
for more information on the scatter evaluation metric.
IllegalArgumentException - If the number of clusters is
not between 1 and the size of this dendrogram inclusive.public double copheneticCorrelation(Distance<? super E> distance)
The cophenetic distance between a pair of elements is defined to be the score of the dendrogram which joins the two elements together; that is, they are in distinct sub-dendrograms of the dendrogram.
A cophenetic distance forms what is known as an ultrametric. It forms a usual metric distance satisfying the following requirements:
In addition, an ultrametric satisfies the following:d(x,y) >= 0 d(x,x) = 0 d(x,y) = d(y,x) d(x,y) + d(y,z) >= d(x,z)
d(x,z) << max(d(x,y), d(y,z))
Geometrically, this means that for all triples x,y,z
at least two of d(x,y),d(x,z),d(y,z) are
the same.
Given the cophenetic distance ultrametric from a dendrogram, both single-link and complete-link clusterers will produce the same dendrogram when clustering based on the cophenetic distance.
distance - Underlying distance measure.public static <E> boolean structurallyEquivalent(Dendrogram<E> dendrogram1, Dendrogram<E> dendrogram2)
true if the specified dendrograms are
structurally equivalent. Two dendrograms are structurally
equivalent if and only if they have the same score, and they
are either both leaf dendrograms with the same object, or they
are both link dendrograms with equivalent daughters (in either
order).
Note that this is not the same relation as Object.equals(Object), which is defined by reference, not
by structural equivalence. Nor is it the same relation as
defined by the score comparators.
dendrogram1 - First dendrogram to compare.dendrogram2 - Second dendrogram to compare.true if the two dendrograms are equivalent.Copyright © 2019 Alias-i, Inc.. All rights reserved.