E - the type of objects being classifiedpublic class KnnClassifier<E> extends Object implements ScoredClassifier<E>, ObjectHandler<Classified<E>>, Compilable, Serializable
KnnClassifier implements k-nearest-neighor
classification based on feature extraction and a vector proximity
or distance. K-nearest-neighbor classification is a kind of
memory-based learning in which every training instance is stored
along with its category. To classify an object, the k nearest
training examples to the object being classified are found.
Each of the k nearest neighbors votes for its training category.
The resulting classification scores are the result of voting.
It is possible to weight the votes by proximity.
K-nearest-neighbor classifiers are particularly effective for highly irregular classification boundaries where linear classifiers like the perceptron have a hard time discriminiting instances. For instance, it's possible to learn checkerboard patterns in 2D space using k-nearest-neighbor classification.
A k-nearest neighbor classifier is constructed using a feature extractor, the number of neighbors k to consider, a vector distance or proximity and a boolean indicator of whether to weight results by proximity or treat the nearest neighbors equally.
Training simply involves storing the feature vector for each
training instance along with its category. The vectors are stored
as instances of SparseFloatVector for
space efficiency. They are constructed using a symbol table for
features based on the specified feature extractor for this
classifier.
Nearness is defined by the proximity or distance functions over vectors supplied at construction time. As objects move nearer to one another, their distance decreases and their proximity increases. Distance measures are converted into proximities behind the scenes by inversion, as it leaves all proximities positive and finite:
proximity(v1,v2) = 1 / (1 + distance(v1,v2))
This will scale distance functions that return results between 0
and positive infinity to return proximities between 0 and 1.
Warning:
Distance functions used for k-nearest-neighbors classification
should not return negative values; any zero or negative values
will be converted to Double.POSITIVE_INFINITY.
Classification involves finding the k nearest neighbors to a query point. Both training instances and instances to be classified are converted to feature mappings using the specified feature extractor, and then encoded as sparse vectors using an implicitly managed feature symbol table.
The first step in classification is simply collecting the k nearest neighbors. That is, the training examples that have the greatest proximity to the example being classified. Given the set of k training examples that are closest to the test point, one of two strategies is used for determing scores. In the simple, unweighted case, the score of a category is simply the number of vectors in the k nearest neighbors with that category. In the weighted case, each vector in the k nearest neighbors contributes its proximity, and the final score is the sum of all proximities.
In most cases, it makes sense to try to optimize for the number of neighbors k using cross-validation or held-out data.
In the weighted case, it sometimes makes sense to take the
maximum number of neighbors k to be very large, potentially even
Integer.MAX_VALUE. This is because the examples are
weighted by proximity, and those far away may have vanishingly
small proximities.
There is no compilation required for k-nearest-neighbors classification, so serialization and compilation produce the same result. The object read back in after serialization or compilation should be identical to the one serialized or compiled.
This is a brute force implementation of k-nearest neighbors in the sense that every training example is multiplied by every object being classified. Some k-nearest-neighbor implementations attempt to efficiently index the training examples, using techniques such as KD trees, so that search for nearest neighbors can be more efficient.
K-nearest-neighbor classification is widely used, and well described in several texts:
| Constructor and Description |
|---|
KnnClassifier(FeatureExtractor<? super E> featureExtractor,
int k)
Construct a k-nearest-neighbor classifier based on the
specified feature extractor and using the specified number of
neighbors.
|
KnnClassifier(FeatureExtractor<? super E> featureExtractor,
int k,
Distance<Vector> distance)
Construct a k-nearest-neighbor classifier based on the
specified feature extractor, specified maximum number of
neighbors, and specified distance function.
|
KnnClassifier(FeatureExtractor<? super E> extractor,
int k,
Proximity<Vector> proximity,
boolean weightByProximity)
Construct a k-nearest-neighbor classifier based on the
specified feature extractor, specified maximum number of
neighbors, specified proximity function, and boolean flag
indicating whether or not to weight nearest neighbors by proximity
during classification.
|
| Modifier and Type | Method and Description |
|---|---|
List<String> |
categories()
Returns a copy of he current list of categories for
this classifier.
|
ScoredClassification |
classify(E in)
Return the k-nearest-neighbor classification result for the
specified input object.
|
void |
compileTo(ObjectOutput out)
Compiles this k-nearest-neighbor classifier to the specified
object output stream.
|
FeatureExtractor<? super E> |
featureExtractor()
Returns the feature extractor for ths KNN classifier.
|
void |
handle(Classified<E> classifiedObject)
Handle the specified classified object as a training instance.
|
int |
k()
Returns the number K of neighbors used by this K-nearest
neighbors classifier.
|
Proximity<Vector> |
proximity()
Returns the proximity measure for this KNN classifier.
|
boolean |
weightByProximity()
Returns
true if resposes are weighted by proximity. |
public KnnClassifier(FeatureExtractor<? super E> featureExtractor, int k)
EuclideanDistance. The nearest neighbors will not be
weighted by proximity, and thus all have equal votes.featureExtractor - Feature extractor for training and
classification instances.k - Maximum number of neighbors to use during
classification.public KnnClassifier(FeatureExtractor<? super E> featureExtractor, int k, Distance<Vector> distance)
featureExtractor - Feature extractor for training and
classification instances.k - Maximum number of neighbors to use during
classification.distance - Distance function to use to compare examples.public KnnClassifier(FeatureExtractor<? super E> extractor, int k, Proximity<Vector> proximity, boolean weightByProximity)
extractor - Feature extractor for training and
classification instances.k - Maximum number of neighbors to use during
classification.proximity - Proximity function to compare examples.weightByProximity - Flag indicating whether to weight
neighbors by proximity during classification.public FeatureExtractor<? super E> featureExtractor()
public Proximity<Vector> proximity()
public List<String> categories()
public boolean weightByProximity()
true if resposes are weighted by proximity.public int k()
public void handle(Classified<E> classifiedObject)
handle in interface ObjectHandler<Classified<E>>classifiedObject - Classified Object to use for training.public ScoredClassification classify(E in)
If this classifier does not weight by proximity, the resulting score for a category will be the number of nearest neighbors of the specified category. That is, it will be a straight vote.
If the classifier does weight by proximity, the resulting
score for a category will be the sum of the proximity scores
for the nearest neighbors of a given category. Instances with
no near neighbors will be scored zero (0). Thus
proximities should be configured to return positive values.
classify in interface BaseClassifier<E>classify in interface RankedClassifier<E>classify in interface ScoredClassifier<E>in - Object to classify.public void compileTo(ObjectOutput out) throws IOException
This is only a convenience method. It provides exactly the same function as standard serialization.
compileTo in interface Compilableout - Output stream to which this classifier is written.IOException - If there is an underlying I/O exception
during compilation.Copyright © 2016 Alias-i, Inc.. All rights reserved.