Benchmarking with ELKI

Setup

For benchmarking, do NOT set -verbose as the verbose output will often come at cost in computation time, and verbosity varies a lot from algorithm to algorithm. Instead use -time to get timing information of the actual algorithms. To disable automatic evaluation, use -evaluator NoAutomaticEvaluation.

Fair benchmarking

ELKI much like any other system (in particular Java based systems) comes with a certain performance cost compared to highly optimized code. An optimized C implementation should be able beat any ELKI algorithm easily. Results based on highly independent implementations must be considered misleading.

When comparing different algorithms you must therefore

Many “reference” algorithms such as DBSCAN benefit immensely of index structures in low dimensional spaces and large data sets!

Therefore, we strongly encourage you to implement all comparison methods in ELKI, too. This is essentially the only way to ensure fair comparisons. If you have found a bottleneck, send us a Patch to improve it for everyone!

Of course we try to make ELKI as fast as possible (but we will not sacrifice extendability or code readability for this).

We also strongly encourage reviewers to pay attention to fair benchmarks. Unfortunately, many articles do lack when it comes to fair experiments.

A study on the issues involved in efficiency evaluations discusses potential problems in empirical efficiency evaluations. It also compares standard algorithms in ELKI and other implementations. As a side result, this study shows that ELKI implementations are very competitive in terms of efficiency:

Hans-Peter Kriegel, Erich Schubert, Arthur Zimek:
The (black) art of runtime evaluation: Are we comparing algorithms or implementations?
Knowledge and Information Systems 2016, DOI: 10.1007/s10115-016-1004-2 [ EE (springerlink) | authorized access (Springer) ]

Performance of ELKI versions

If you are not convinced, here is an empirical example of how huge implementation differences are.

For this, we will compare ELKI with ELKI, just using different versions.

For this benchmark we use the ALOI image data set, which has 110.250 objects, with just 8 dimensions (so the algorithms, distance functions and index structures are supposed okay). We tested four workloads, each with and without an R*-Tree index. For this experiment, all R*-Trees were incrementally loaded (bulk loading is faster, but was not yet available in all versions)

The test system is a 2.67 GHz Intel Xeon X5650 (single-threaded, memory limit 32 GB)

Task ELKI 0.1 ELKI 0.2 ELKI 0.3 ELKI 0.4 ELKI 0.5.0 ELKI 0.5.5 ELKI 0.6.0 ELKI 0.6.5
Load only n/a 13.0 15.4 3.2 4.0 4.0 2.7 2.3
Load into R*-Tree n/a 21.9 22.0 9.5 8.6 8.6 3.3 2.7
10NN queries each n/a 2077.2 1579.2 657.5 656.2 584.9 546.3 443.0
10NN with R*-Tree n/a 290.9 126.0 111.0 72.2 71.5 22.0 20.0
Single-Link clustering n/a 5099.3 4662.1 1353.7 1383.3 875.4 621.8 613.3
DBSCAN clustering 3140.8 2368.4 1651.6 669.1 610.6 603.2 478.3 443.3
DBSCAN with R*-Tree n/a 303.4 116.7 111.4 87.7 91.3 35.0 33.7
OPTICS clustering 3133.1 2395.3 1884.8 704.5 586.4 546.4 511.4 445.3
OPTICS with R*-Tree n/a 448.6 235.4 179.9 142.3 141.5 69.0 56.8
Local Outlier Factor n/a 2189.0 1643.6 710.8 570.9 605.2 545.0 442.5
LOF with R*-Tree n/a 321.4 146.4 135.8 93.0 90.8 38.4 35.7
k-means lloyd k=100 249.7 125.6 81.2 77.4 60.2 57.3 76.7 86.5

As you can see, ELKI 0.3 had a slightly more expensive parser, but was already slightly faster than ELKI 0.2. In ELKI 0.4 we see major performance gains, which we attribute to removing Java’s auto-boxing and unboxing in various places. ELKI 0.5 and 0.6 used improved data structures. For ELKI 0.6.5 many optimizations were low level, at the parser, and with specialized optimizations for Euclidean distance only. Note that k-means is randomized, and we observe a standard deviation of +-20. We did not enable improved seeding methods available in later versions.

However, we cannot remove all boxing/unboxing in Java without losing much of the generality of ELKI. We expect significant performance gains to be possible with a low-level C implementation. (Which are, however, not of scientific interest to us.) Therefore, benchmarking ELKI against implementations in C (or R or Matlab, or any other language that has high-performance mathematical libraries embedded) is not sound!

ELKI in comparison to other software

We advocate to not compare algorithms from different frameworks with each other. Implementation details can make a huge difference. For example k-means from the R “flexclus” package seems to be about half as fast as the R native kmeans. In the most extreme example of this benchmark, the same algorithm is 280x faster in one implementation than the other (LOF in ELKI vs. LOF in “Data mining with R”). So from a scientific point of view, even a performance difference of two orders of magnitude can be explained with implementation differences.

2.67 GHz Intel Xeon X5650 (single-threaded, memory limit 32 GB)

Task ELKI 0.6.0 WEKA 3.7.5
Load only 2.74 3.6 (filter)
Load into R*-Tree 3.33 n/a
10NN queries each 546.32 n/a
10NN with R*-Tree 21.96 n/a
Single-Link clustering 621.80 (too large)
DBSCAN clustering 478.28 5090.19
DBSCAN with R*-Tree 35.02 n/a
OPTICS clustering 511.43 4667.42 (incomplete)
OPTICS with R*-Tree 69.01 n/a
EM clustering 84.65 835.56
Local Outlier Factor 545.02 2611.60
LOF with R*-Tree 38.39 n/a
k-means lloyd k=100 76.70 372.49 (with “-fast”)

Single-Link clustering could not be run in Weka or R. Weka crashes with an IllegalArgumentException, due to an integer overflow. R detects the overflow, and fails with “vector size specified is too large”. This is not surprising, as both use an implementation that needs quadratic memory and cubic time. The implementation in ELKI is the SLINK algorithm, which needs linear memory and quadratic time only, and is thus expected to be significantly faster on a large data set such as this. But hierarchical clustering is not sensible for large data sets anyway. R also fails with the same error on EM (because it pre-clusters the data with hierarchical clustering), and OPTICS in Weka is an incomplete implementation only and does not extract clusters from the plot. k-means results have a high variance, with a standard deviation of more than 15 seconds for ELKI - depending on the random seed, it may converge quickly.

Weka DBSCAN and OPTICS runtime has decreased 8x with extension version 1.0.3, by removing unnecessary safety checks. ELKI’s DBSCAN has become 5x faster across versions. Do not do runtime benchmarking on code that you did not profile and optimize to the same extent - the result will be meaningless!

More examples of performance changes across versions can be found in this study:

Hans-Peter Kriegel, Erich Schubert, Arthur Zimek:
The (black) art of runtime evaluation: Are we comparing algorithms or implementations?
Knowledge and Information Systems 2016 [ EE (springerlink) | authorized access (Springer) ]