Implement K-means clustering algorithm using Python and Scikit-learn

In this tutorial, we implement the k-means clustering algorithm using Python and also using Scikit-learn.

What is k-means?

Read K-means clustering algorithm for introduction and solved example.

Using core Python

Here we are going take use a sample of the Iris dataset and three random means. We run the k-means algorithm, iterating for 5 times. We update the means at the end of each iteration.

The output list contains the clusters obtained at the end of each iteration. The element output[-1], which points to the last element of the list contains the clusters assigned at the end of the last iteration.

import math
import numpy as np
dataset = [[5.1,3.5,1.4,0.2],
           [4.6,3.6,1.0,0.2],
           [5.9,3.0,4.2,1.5],
           [5.4,3.0,4.5,1.5],
           [7.7,2.8,6.7,2.0],
           [7.9,3.8,6.4,2.0]]
k = 3
n = 5
means = [[4.4,2.9,1.4,0.2],
         [6.1,2.9,4.7,1.4],
         [7.2,3.2,6.0,1.8]]
 
output = []
for x in range(n):
    iteration_output = []
    for dataitem in dataset:
        distance_list = []
        for m in range(k):
            distance = 0
            for i in range(len(dataitem)):
                distance += (dataitem[i]-means[m][i])**2
            distance_list.append(math.sqrt(distance))
        #print(distance_list)
        iteration_output.append(np.argmin(distance_list))
    output.append(iteration_output)
    
    new_means_sum = []
    new_means = [[0] * len(dataset[0])] * k
    count = [0] * k
    for i in range(k):
        sum_list = np.zeros(len(means[0]))
        for j in range(len(dataset)):
            if i == iteration_output[j]:
                count[i] += 1
                sum_list = np.add(sum_list,dataset[j])
        new_means_sum.append(sum_list.tolist())
 
    new_means = means
    for i in range(k):
        for j in range(len(means)):
            if count[i] != 0:
                new_means[i][j] = new_means_sum[i][j]/count[i]
    means = new_means
#print(output)
print(output[-1])

Using Scikit-learn

We load the Iris dataset using Pandas. Then we use Scikit-learn to cluster the dataset into three classes. Finally, we plot the original dataset and the clusters we obtained using Pandas.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

k = 3

df=pd.read_csv('iris.csv')
X=pd.get_dummies(df.loc[:, ['sepal_length', 'sepal_width','petal_length','petal_width']])

y_true=df.loc[:, 'species']

from sklearn.cluster import KMeans
model = KMeans(n_clusters=k)
y_pred = model.fit_predict(X)
y_pred = pd.Series(data=y_pred)

X['species'] = y_true
X['species_pred'] = y_pred

clusters = pd.DataFrame(model.cluster_centers_)
clusters.columns = ['sepal_length', 'sepal_width','petal_length','petal_width']
clusters['species'] = 'centers'
clusters['species_pred'] = 'centers'

out = X.append(clusters)

sns.pairplot(out.drop(['species_pred'],axis=1), hue="species")
sns.pairplot(out.drop(['species'],axis=1), hue="species_pred")
plt.show()