Get nearest point to centroid, scikit-learn?

I tried the above answer but it gives me duplicates in the result. The above finds the closest data point regardless the clustering results. Hence it can return duplicates of the same cluster.

If you want to find the closest data within the same cluster that the center indicates, try this.

This solution gives the data points are from all different clusters and also the number of returned data points is same as the number of clusters.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

# assume the total number of data is 100
all_data = [ i for i in range(100) ]
tf_matrix = numpy.random.random((100, 100))

# set your own number of clusters
num_clusters = 2

m_km = KMeans(n_clusters=num_clusters)  
m_km.fit(tf_matrix)
m_clusters = m_km.labels_.tolist()

centers = np.array(m_km.cluster_centers_)

closest_data = []
for i in range(num_clusters):
    center_vec = centers[i]
    data_idx_within_i_cluster = [ idx for idx, clu_num in enumerate(m_clusters) if clu_num == i ]

    one_cluster_tf_matrix = np.zeros( (  len(data_idx_within_i_cluster) , centers.shape[1] ) )
    for row_num, data_idx in enumerate(data_idx_within_i_cluster):
        one_row = tf_matrix[data_idx]
        one_cluster_tf_matrix[row_num] = one_row

    closest, _ = pairwise_distances_argmin_min(center_vec, one_cluster_tf_matrix)
    closest_idx_in_one_cluster_tf_matrix = closest[0]
    closest_data_row_num = data_idx_within_i_cluster[closest_idx_in_one_cluster_tf_matrix]
    data_id = all_data[closest_data_row_num]

    closest_data.append(data_id)

closest_data = list(set(closest_data))

assert len(closest_data) == num_clusters

This is not the medoid, but here's something you can try:

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> from sklearn.metrics import pairwise_distances_argmin_min
>>> X = np.random.randn(10, 4)
>>> km = KMeans(n_clusters=2).fit(X)
>>> closest, _ = pairwise_distances_argmin_min(km.cluster_centers_, X)
>>> closest
array([0, 8])

The array closest contains the index of the point in X that is closest to each centroid. So X[0] is the closest point in X to centroid 0, and X[8] is the closest to centroid 1.


What you are trying to achieve is basically vector quantization, but in "reverse". Scipy has a very optimized function for that, much faster than the other methods mentioned. The output is the same as with pairwise_distances_argmin_min().

    from scipy.cluster.vq import vq

    # centroids: N-dimensional array with your centroids
    # points:    N-dimensional array with your data points

    closest, distances = vq(centroids, points)

The big difference comes when you execute it with very big arrays, I executed it with an array of 100000+ points and 65000+ centroids, and this method is 4 times faster than pairwise_distances_argmin_min() from scikit, as shown below:

     start_time = time.time()
     cl2, dst2 = vq(centroids, points)
     print("--- %s seconds ---" % (time.time() - start_time))
     --- 32.13545227050781 seconds ---

     start_time = time.time()
     cl2, dst2 = pairwise_distances_argmin_min(centroids, points)
     print("--- %s seconds ---" % (time.time() - start_time))
     --- 131.21064710617065 seconds ---