Find partial membership with KMeans clustering algorithm

You should be able to use Accord.NET to get the "centroids" of the clusters that the K-means algorithm finds. Those are essentially the centres of the individual clusters. You should then be able to calculate the distance between your new data point and each of the centroids to see which of the centroids are close to your point. (The Decide method returns just the first one.)

I have not tried this, but it seems that KMeans exposes Clusters, which is a KMeansClusterCollection and has the Centroids property (see the docs). It also exposes the Distance property which returns the function for calculating distance between the data points.

Using these, you should be able to compare the distance of your data point with the centroids of all the clusters and decide how close the point is to individual clusters.

Implementing k-means from scratch is not that hard (there is a nice post from Mathias Brandewinder on this), but it seems that Accord.NET exposes all the information that you need in this particular case - so perhaps that's all you need (getting all the details right in custom implementation is always the hardest part...).


As mentioned by Tomas, Accord.NET already gives you many of the building blocks. In particular, calling clusterModel.Scores gives you the (negative) distances to the cluster centroids, see source code

From the negative distances, you can compute an approximate class membership score by taking exponentials, similar to what you would do to compute a Gaussian PDF. In F#, that would look like:

// Scores returns the negative distances between each point
// and the cluster centroid
let negDistances = clusterModel.Scores vals
// Compute an estimated cluster assigment score
let clusterMembership =
    negDistances
    |> Array.map (fun distances ->
        // Take the Exponential of the (negative) distances,
        // as in computing a Gaussian pdf
        let expDist = distances |> Array.map Math.Exp
        let total = Array.sum expDist
        expDist
        |> Array.map (fun d -> d/total)
    )

There are a couple of caveats here:

  • Standard KMeans in Accord uses Euclidean distances, meaning that each direction carries the same weight. Depending on the nature of your data, this may not lead to reasonable results (picture 2 clusters, each shaped like a long cigar)
  • The above class membership calculcation is not taking cluster covariance into account either. To be closer to truth, you would have to compute Bhattacharyya distance, exponentiate, and then scale by inverse det of the covariance matrix. This will fail for singleton clusters.

Regarding your third question: I would not re-implement. It may seem easy initially, but there are usually plenty of corner cases and stability issues that you only run into after some time.