Which algorithm and what combination of hyper-parameters will be the best to cluster this data?

If it doesn't work, always try to improve the preprocessing first. Algorithms such as k-means are very sensitive to scaling, so that is something that needs to be chosen carefully.

GMM is clearly your first choice here. It may be worth trying out different tools. R's Mclust is very slow. Sklearn's GMM is sometimes unstable. ELKI is a bit harder to get started with, but its EM gave me the best results usually.

Apart from GMM, it likely is worth trying out correlation clustering. These algorithms assume there is some manifold (e.g., a line) on which a cluster exists. Examples include ORCLUS, LMCLUS, CASH, 4C, ... But in my opinion these mostly work for synthetic toy data.

Which algorithm and what combination of hyper-parameters will be the best to cluster this data?

Tags:

Cluster Analysis

K Means

Unsupervised Learning

Data Science

Gmm

Related

Recent Posts