K-nearest neighbor classification

Who is a neighbor?

A neighbor is one whose distance from us in comparatively lesser than others. In k-nn, we are interested in finding the k neighbors (k points with minimum distances) among all the given values.

k-nearest neighbors with k=3
k-nearest neighbors with k=3

What does distance mean?

Euclidian distance: The length of a straight line connecting two points in n-dimensions.

If p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in n-dimensional Euclidian space, then their Euclidian distance is given by the following Pythagorean formula –

\(q(p,q) = \sqrt{({p_1-q_1})^2 + ({p_2-q_2})^2 + ({p_3-q_3})^2 + … + ({p_n-q_n})^2}\)

Manhattan Distance: The sum of the absolute difference between two points

\(q(p,q) = \sum\limits_{i=1}^{n} |p_i-q_i|\)

The following image shows the difference between the Manhattan and Euclidian distance – the red, blue and yellow paths give the shortest path of length 12 in Manhattan distance; the green line is the unique shortest path of distance 8.49

Euclidean vs Manhattan distance
Euclidean vs Manhattan distance

Another commonly used distance metric is Minkowski distance, given below –

\(q(p,q) = ({\sum_{i=1}^{n} {|p_i-q_i|}^p})^{1/p}\)

When p = 1, this is equivalent to the Manhattan distance, and when p = 2, this is equivalent to the Euclidian distance.

If p < 1, distance between (0,0) and (1,1) would be 21/p > 2, while the distance between (0,0) and (0,1), and (1,1) and (0,1) would be 1 each. This violates the triangle inequality.

In tasks such as text or gene classification, we may use other metrics like Hamming distance or Levenshtein (Edit) distance. We can also use correlation coefficient as a distance parameter.

K-nearest neighbors

Choose k values from the nearest neighbors. Make predictions based on the characteristics of the chosen neighbors.

Non-parametric learning technique for classification and regression

  • Classification – output the class to which is the majority of its k-nearest neighbors
  • Regression – output the mean (average) of its k-nearest neighbors
    Weighting can be used to give more priority to a nearer neighbor than to one farther away
  • 1/d is a good weighting scheme, where d is its distance of the neighbor
  • It is an instance-based learning or lazy learning
  • We try to find information (about its k neighbors) from the dataset after we are given a query
  • There is no explicit training before the query
  • Skewed data may degrade performance – a more frequent class may dominate classification because of its large quantity
  • The algorithm may be sped up by using advanced neighbor search algorithms – k-d trees, R-trees
  • Among the simplest of all machine learning algorithms

What is a good value for ‘k’?

This depends on the data – generally, a large value for k may be useful to mitigate the effect of noise, but it would also make the boundaries between classes less distinct.

Example

We want to predict an unknown Iris species for the following dataset through K-nearest neighbors using Euclidean distance –

Sepal Length Sepal Width Petal Length Petal Width Species
5.1 3.5 1.4 0.2 Iris-setosa
4.6 3.6 1.0 0.2 Iris-setosa
5.9 3.0 4.2 1.5 Iris-versicolor
5.4 3.0 4.5 1.5 Iris-versicolor
7.7 2.8 6.7 2.0 Iris-virginica
7.9 3.8 6.4 2.0 Iris-virginica

Recall –

There is no learning in K-nearest neighbors
In each query, we need to go through the entire dataset and predict its species –
Given the following characteristics of an Iris flower –

Sepal Length Sepal Width Petal Length Petal Width
6.1 2.9 4.7 1.4

We need to figure out its distance from the data items in the dataset –

Sepal Length Sepal Width Petal Length Petal Width Species Distance
5.1 3.5 1.4 0.2 Iris-setosa 3.7
4.6 3.6 1.0 0.2 Iris-setosa 4.227
5.9 3.0 4.2 1.5 Iris-versicolor 0.556
5.4 3.0 4.5 1.5 Iris-versicolor 0.741
7.7 2.8 6.7 2.0 Iris-virginica 2.632
7.9 3.8 6.4 2.0 Iris-virginica 2.701

From the above table, we can see that the 3rd, 4th, and 5th rows have the 3 minimum distances from our query. Of these 3 neighbors, we see that two belong to species – Iris-versicolor and one belong to Iris-virginica. So, using k-nn, we would classify our queried item as Iris-versicolor.