#### Clustering

- The most important unsupervised learning algorithm
- The process of grouping similar times together

#### Types of clustering –

- Based on clustering criterion –
- Monothetic
- 3 clusters – high, average, low marks of students – based on test score (percentage, CGPA)

- Polythetic
- ‘Similar’ to each other – similar objects are clustered together

- Monothetic
- Based on clustering output –
- Hard – belongs to one and only one cluster – mutually exclusive clustering
- Soft – the probability of belonging to the different clusters

#### K-Means

Observe the data given below –

We have 12 data points and are told that they belong to 3 clusters.

In k-means, we take k different points in the dataset that represents the means of the data in its cluster.

We begin k-means algorithm by guessing k different random initial points as the initial values of the k different means.

The 3 different mean values divide the entire space into 3 regions –

We update the means of the 3 clusters –

We cluster the data-items again using the new means –

We repeat the above two steps until convergence – until there is no more update in the dataset.

Following diagram illustrates the four steps of the K-means algorithm –

#### Applications

**Marketing:**finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;**Biology:**classification of plants and animals given their features;**Libraries:**book ordering;**Insurance:**identifying groups of motor insurance policyholders with a high average claim cost;- Identifying frauds;
**City-planning:**identifying groups of houses according to their house type, value and geographical location;**Earthquake studies:**clustering observed earthquake epicenters to identify dangerous zones;**WWW:**document classification; clustering weblog data to discover groups of similar access patterns.

#### Example –

Recall the Iris dataset from K-nearest neighbors –

Serial No. | Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|---|

1 | 5.1 | 3.5 | 1.4 | 0.2 |

2 | 4.6 | 3.6 | 1.0 | 0.2 |

3 | 5.9 | 3.0 | 4.2 | 1.5 |

4 | 5.4 | 3.0 | 4.5 | 1.5 |

5 | 7.7 | 2.8 | 6.7 | 2.0 |

6 | 7.9 | 3.8 | 6.4 | 2.0 |

The difference between this and the original dataset is that in the original dataset we had a species label associated with each data item – in this dataset we don’t.

But we know that there are 3 species in total. Can we group them into 3 categories?

#### K-means with k=3

Let us try with k-means where k=3.

We need to start with 3 random means –

we initially choose 3 random means

suppose, we start with the following random means –

Serial No. | Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|---|

1 | 4.4 | 2.9 | 1.4 | 0.2 |

2 | 6.1 | 2.9 | 4.7 | 1.4 |

3 | 7.2 | 3.2 | 6.0 | 1.8 |

Let us calculate the distance from each dataitem –

Serial No. | Mean 1 | Mean 2 | Mean 3 |
---|---|---|---|

1 | 0.9219 | 3.7 | 5.3122 |

2 | 0.83066 | 4.2273 | 5.8719 |

3 | 3.4336 | 0.5567 | 2.2494 |

4 | 3.5085 | 0.7416 | 2.3706 |

5 | 6.4984 | 2.6324 | 0.9695 |

6 | 6.4265 | 2.7018 | 1.0246 |

Clearly, we can see that the data-items 1 & 2 are nearest to the 1st mean, 3 & 4 are nearest to the 2nd, and 5 & 6 are nearest to the last one. So, we have the clusters – {{1, 2}, {3, 4}, {5, 6}}. We now recalculate the means using the new clusters. The new means would be –

Serial No. | Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|---|

1 | 4.85 | 3.55 | 1.2 | 0.2 |

2 | 5.65 | 3.0 | 4.35 | 1.4 |

3 | 7.8 | 3.3 | 6.55 | 1.8 |

Now we calculate the distance of the data-points again using the new means –

Serial No. | Mean 1 | Mean 2 | Mean 3 |
---|---|---|---|

1 | 0.3240 | 3.2703 | 6.0342 |

2 | 0.3240 | 3.7583 | 6.6100 |

3 | 3.4777 | 0.3082 | 3.0516 |

4 | 3.6311 | 0.3082 | 3.1847 |

5 | 6.4942 | 3.1819 | 0.5678 |

6 | 6.2964 | 3.2039 | 0.5678 |

We see that although the distance from the different means have changed, but the nearest mean for each of the dataitem stays the same – there is no change in the clustering. Thus, the algorithm has converged and we can terminate it.