### Who is a neighbor?

A neighbor is one whose distance from us in comparatively lesser than others. In k-nn, we are interested in finding the k neighbors (k points with minimum distances) among all the given values.

#### What does distance mean?

**Euclidian distance**: The length of a straight line connecting two points in n-dimensions.

If p = (p_{1}, p_{2},…, p_{n}) and q = (q_{1}, q_{2},…, q_{n}) are two points in n-dimensional Euclidian space, then their Euclidian distance is given by the following Pythagorean formula –

\(q(p,q) = \sqrt{({p_1-q_1})^2 + ({p_2-q_2})^2 + ({p_3-q_3})^2 + … + ({p_n-q_n})^2}\)

**Manhattan Distance**: The sum of the absolute difference between two points

\(q(p,q) = \sum\limits_{i=1}^{n} |p_i-q_i|\)

The following image shows the difference between the Manhattan and Euclidian distance – the red, blue and yellow paths give the shortest path of length 12 in Manhattan distance; the green line is the unique shortest path of distance 8.49

Another commonly used distance metric is **Minkowski distance**, given below –

\(q(p,q) = ({\sum_{i=1}^{n} {|p_i-q_i|}^p})^{1/p}\)

When p = 1, this is equivalent to the Manhattan distance, and when p = 2, this is equivalent to the Euclidian distance.

If p < 1, distance between (0,0) and (1,1) would be 2^{1/p} > 2, while the distance between (0,0) and (0,1), and (1,1) and (0,1) would be 1 each. This violates the triangle inequality.

In tasks such as text or gene classification, we may use other metrics like **Hamming distance** or **Levenshtein (Edit) distance**. We can also use correlation coefficient as a distance parameter.

#### K-nearest neighbors

Choose k values from the nearest neighbors. Make predictions based on the characteristics of the chosen neighbors.

#### Non-parametric learning technique for classification and regression

- Classification – output the class to which is the majority of its k-nearest neighbors
- Regression – output the mean (average) of its k-nearest neighbors

Weighting can be used to give more priority to a nearer neighbor than to one farther away - 1/d is a good weighting scheme, where d is its distance of the neighbor
- It is an instance-based learning or lazy learning
- We try to find information (about its k neighbors) from the dataset after we are given a query
- There is no explicit training before the query
- Skewed data may degrade performance – a more frequent class may dominate classification because of its large quantity
- The algorithm may be sped up by using advanced neighbor search algorithms – k-d trees, R-trees
- Among the simplest of all machine learning algorithms

#### What is a good value for ‘k’?

This depends on the data – generally, a large value for k may be useful to mitigate the effect of noise, but it would also make the boundaries between classes less distinct.

#### Example

We want to predict an unknown Iris species for the following dataset through K-nearest neighbors using Euclidean distance –

Sepal Length | Sepal Width | Petal Length | Petal Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |

4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |

5.9 | 3.0 | 4.2 | 1.5 | Iris-versicolor |

5.4 | 3.0 | 4.5 | 1.5 | Iris-versicolor |

7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |

7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |

**Recall –**

There is no learning in K-nearest neighbors

In each query, we need to go through the entire dataset and predict its species –

Given the following characteristics of an Iris flower –

Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|

6.1 | 2.9 | 4.7 | 1.4 |

We need to figure out its distance from the data items in the dataset –

Sepal Length | Sepal Width | Petal Length | Petal Width | Species | Distance |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 3.7 |

4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa | 4.227 |

5.9 | 3.0 | 4.2 | 1.5 | Iris-versicolor | 0.556 |

5.4 | 3.0 | 4.5 | 1.5 | Iris-versicolor | 0.741 |

7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica | 2.632 |

7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica | 2.701 |

From the above table, we can see that the 3rd, 4th, and 5th rows have the 3 minimum distances from our query. Of these 3 neighbors, we see that two belong to species – Iris-versicolor and one belong to Iris-virginica. So, using k-nn, we would classify our queried item as Iris-versicolor.