Clustering
- The most important unsupervised learning algorithm
- The process of grouping similar times together
Types of clustering –
- Based on clustering criterion –
- Monothetic
- 3 clusters – high, average, low marks of students – based on test score (percentage, CGPA)
- Polythetic
- ‘Similar’ to each other – similar objects are clustered together
- Monothetic
- Based on clustering output –
- Hard – belongs to one and only one cluster – mutually exclusive clustering
- Soft – the probability of belonging to the different clusters
K-Means
Observe the data given below –
We have 12 data points and are told that they belong to 3 clusters.
In k-means, we take k different points in the dataset that represents the means of the data in its cluster.
We begin k-means algorithm by guessing k different random initial points as the initial values of the k different means.
The 3 different mean values divide the entire space into 3 regions –
We update the means of the 3 clusters –
We cluster the data-items again using the new means –
We repeat the above two steps until convergence – until there is no more update in the dataset.
Following diagram illustrates the four steps of the K-means algorithm –
Applications
- Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
- Biology: classification of plants and animals given their features;
- Libraries: book ordering;
- Insurance: identifying groups of motor insurance policyholders with a high average claim cost;
- Identifying frauds;
- City-planning: identifying groups of houses according to their house type, value and geographical location;
- Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
- WWW: document classification; clustering weblog data to discover groups of similar access patterns.
Example –
Recall the Iris dataset from K-nearest neighbors –
Serial No. | Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 |
2 | 4.6 | 3.6 | 1.0 | 0.2 |
3 | 5.9 | 3.0 | 4.2 | 1.5 |
4 | 5.4 | 3.0 | 4.5 | 1.5 |
5 | 7.7 | 2.8 | 6.7 | 2.0 |
6 | 7.9 | 3.8 | 6.4 | 2.0 |
The difference between this and the original dataset is that in the original dataset we had a species label associated with each data item – in this dataset we don’t.
But we know that there are 3 species in total. Can we group them into 3 categories?
K-means with k=3
Let us try with k-means where k=3.
We need to start with 3 random means –
we initially choose 3 random means
suppose, we start with the following random means –
Serial No. | Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|---|
1 | 4.4 | 2.9 | 1.4 | 0.2 |
2 | 6.1 | 2.9 | 4.7 | 1.4 |
3 | 7.2 | 3.2 | 6.0 | 1.8 |
Let us calculate the distance from each dataitem –
Serial No. | Mean 1 | Mean 2 | Mean 3 |
---|---|---|---|
1 | 0.9219 | 3.7 | 5.3122 |
2 | 0.83066 | 4.2273 | 5.8719 |
3 | 3.4336 | 0.5567 | 2.2494 |
4 | 3.5085 | 0.7416 | 2.3706 |
5 | 6.4984 | 2.6324 | 0.9695 |
6 | 6.4265 | 2.7018 | 1.0246 |
Clearly, we can see that the data-items 1 & 2 are nearest to the 1st mean, 3 & 4 are nearest to the 2nd, and 5 & 6 are nearest to the last one. So, we have the clusters – {{1, 2}, {3, 4}, {5, 6}}. We now recalculate the means using the new clusters. The new means would be –
Serial No. | Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|---|
1 | 4.85 | 3.55 | 1.2 | 0.2 |
2 | 5.65 | 3.0 | 4.35 | 1.4 |
3 | 7.8 | 3.3 | 6.55 | 1.8 |
Now we calculate the distance of the data-points again using the new means –
Serial No. | Mean 1 | Mean 2 | Mean 3 |
---|---|---|---|
1 | 0.3240 | 3.2703 | 6.0342 |
2 | 0.3240 | 3.7583 | 6.6100 |
3 | 3.4777 | 0.3082 | 3.0516 |
4 | 3.6311 | 0.3082 | 3.1847 |
5 | 6.4942 | 3.1819 | 0.5678 |
6 | 6.2964 | 3.2039 | 0.5678 |
We see that although the distance from the different means have changed, but the nearest mean for each of the dataitem stays the same – there is no change in the clustering. Thus, the algorithm has converged and we can terminate it.