K-means clustering algorithm

Clustering

  • The most important unsupervised learning algorithm
  • The process of grouping similar times together

Types of clustering –

  • Based on clustering criterion –
    • Monothetic
      • 3 clusters – high, average, low marks of students – based on test score (percentage, CGPA)
    • Polythetic
      • ‘Similar’ to each other – similar objects are clustered together
  • Based on clustering output –
    • Hard – belongs to one and only one cluster – mutually exclusive clustering
    • Soft – the probability of belonging to the different clusters

K-Means

Observe the data given below –

kmeans example 1

We have 12 data points and are told that they belong to 3 clusters.

In k-means, we take k different points in the dataset that represents the means of the data in its cluster.

We begin k-means algorithm by guessing k different random initial points as the initial values of the k different means.

kmeans example 1

The 3 different mean values divide the entire space into 3 regions –

kmeans example 2

We update the means of the 3 clusters –

kmeans example 3

We cluster the data-items again using the new means –

kmeans example 4

We repeat the above two steps until convergence – until there is no more update in the dataset.

Following diagram illustrates the four steps of the K-means algorithm –

Applications

  • Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
  • Biology: classification of plants and animals given their features;
  • Libraries: book ordering;
  • Insurance: identifying groups of motor insurance policyholders with a high average claim cost;
  • Identifying frauds;
  • City-planning: identifying groups of houses according to their house type, value and geographical location;
  • Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
  • WWW: document classification; clustering weblog data to discover groups of similar access patterns.

Example –

Recall the Iris dataset from K-nearest neighbors –

Serial No. Sepal Length Sepal Width Petal Length Petal Width
1 5.1 3.5 1.4 0.2
2 4.6 3.6 1.0 0.2
3 5.9 3.0 4.2 1.5
4 5.4 3.0 4.5 1.5
5 7.7 2.8 6.7 2.0
6 7.9 3.8 6.4 2.0

The difference between this and the original dataset is that in the original dataset we had a species label associated with each data item – in this dataset we don’t.

But we know that there are 3 species in total. Can we group them into 3 categories?

K-means with k=3

Let us try with k-means where k=3.

We need to start with 3 random means –

we initially choose 3 random means
suppose, we start with the following random means –

Serial No. Sepal Length Sepal Width Petal Length Petal Width
1 4.4 2.9 1.4 0.2
2 6.1 2.9 4.7 1.4
3 7.2 3.2 6.0 1.8

Let us calculate the distance from each dataitem –

Serial No. Mean 1 Mean 2 Mean 3
1 0.9219 3.7 5.3122
2 0.83066 4.2273 5.8719
3 3.4336 0.5567 2.2494
4 3.5085 0.7416 2.3706
5 6.4984 2.6324 0.9695
6 6.4265 2.7018 1.0246

Clearly, we can see that the data-items 1 & 2 are nearest to the 1st mean, 3 & 4 are nearest to the 2nd, and 5 & 6 are nearest to the last one. So, we have the clusters – {{1, 2}, {3, 4}, {5, 6}}. We now recalculate the means using the new clusters. The new means would be –

Serial No. Sepal Length Sepal Width Petal Length Petal Width
1 4.85 3.55 1.2 0.2
2 5.65 3.0 4.35 1.4
3 7.8 3.3 6.55 1.8

Now we calculate the distance of the data-points again using the new means –

Serial No. Mean 1 Mean 2 Mean 3
1 0.3240 3.2703 6.0342
2 0.3240 3.7583 6.6100
3 3.4777 0.3082 3.0516
4 3.6311 0.3082 3.1847
5 6.4942 3.1819 0.5678
6 6.2964 3.2039 0.5678

We see that although the distance from the different means have changed, but the nearest mean for each of the dataitem stays the same – there is no change in the clustering. Thus, the algorithm has converged and we can terminate it.