Cluster Analysis in ELKI
Version information: Updated for ELKI 0.7.0
Overview
ELKI contains a wide variety of clustering algorithms. These can be roughly divided into the following families:
 Hierarchical agglomerative clustering (e.g. HAC, AGNES, SLINK)
 Kmeans clustering family (e.g. Lloyd, Forgy, MacQueen, Elkan, Hamerly, Philips, PAM, KMedians)
 Mixture modeling family (Gaussian Mixture Modeling GMM, EM with different models)
 Density based clustering (e.g. DBSCAN, OPTICS, HDBSCAN*, MeanShift)
 Matrixoptimization algorithms (Affinity Propagation)
 Subspace clustering algorithms (axisparallel subspaces only, e.g. PROCLUS, SUBCLU, P3C)
 Correlation clustering algorithms (arbitrarily oriented, e.g. CASH, 4C, LMCLUS, ORCLUS)
 Uncertain data clustering (e.g. UKMeans, FDBSCAN, Consensus)
 Biclustering algorithms (Cheng and Church)
Recommendations
Hierarchical clustering
For singlelinkage, SLINK is the fastest algorithm (Quadratic runtime with small constant factors, linear memory).
For completelinkage, CLINK is fast but appears to give worse results than the others.
For other linkages, the Anderberg is usually the best choice we currently offer.
Kmeans
The classic Lloyd kmeans is not worth trying. The Hamerly, Elkan, and MacQueen (0.7.1: Compare and Sort) algorithms appear to always outperform the textbook algorithm.
Xmeans can be useful to choose the parameter k.
DBSCAN and OPTICS
Always use an index with DBSCAN. Recommeded is to use the SimpleCoverTree index, which works for most data sets and requires no other parameters except the distance function.
MinPts should be chosen larger than the data set dimensionality (e.g. 2*dim). For 2dimensional data, the absolute minimum to use is 4, but values such as 10 or 20 may be appropriate for noisy data.
If you do now have prior knowledge about a good value for Epsilon, you can try the following:

try OPTICS, with epsilon set to an upper bound (lower = faster) and look for a height where there are nice valleys visible

try OPTICSXi, with xi set to e.g. 0.01 (1% decrease)

use KNNDistancesSampler with k = MinPts  1. Look in the bottom left area of the resulting plot, for a “knee”. As a rule of thumb, 10% of core points may be enough, unless you see a clear knee further to the right.