## Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Here we will consider K-means clustering, where we will cluster objects into k-clusters. The clusters will be formed by determimning centroids of each cluster, then membership to the cluster is determined by an observations shortest distance to the centroid.

For this problem we will work with a generated dataset.

In [None]:
# Import the digits data set
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape


In [None]:
# Fit and Predict 10 clusters on this dataset 
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

In [None]:
# Visualize
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
 axi.set(xticks=[], yticks=[])
 axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

In [None]:
# Match each learned cluster label with the true labels found in them:
import numpy as np
from scipy.stats import mode

labels = np.zeros_like(clusters)
for i in range(10):
 mask = (clusters == i)
 labels[mask] = mode(digits.target[mask])[0]

In [None]:
# Print the accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

In [None]:
# Visualize the confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
 xticklabels=digits.target_names,
 yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');