ELKI: Environment for Developing KDD-Applications Supported by Index-Structures
ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms.
Data mining research leads to many algorithms for similar tasks.
A fair and useful comparison of these algorithms is difficult due to several reasons:
- Implementations of comparison partners are not at hand.
- If implementations of different authors are provided, an evaluation in terms of efficiency is biased to evaluate the efforts of different authors in efficient programming instead of evaluating algorithmic merits.
On the other hand, efficient data management tools like index-structures can show considerable impact on data mining tasks and are therefore useful for a broad variety of algorithms.
In ELKI, data mining algorithms and data management tasks are separated and allow for an independent evaluation. This separation makes ELKI unique among data mining frameworks like Weka or Rapidminer and frameworks for index structures like GiST. At the same time, ELKI is open to arbitrary data types, distance or similarity measures, or file formats. The fundamental approach is the independence of file parsers or database connections, data types, distances, distance functions, and data mining algorithms. Helper classes, e.g., for algebraic or analytic computations are available for all algorithms on equal terms.
With the development and publication of ELKI, we humbly hope to serve the data mining and database research community beneficially. The framework is free for scientific usage (“free” as in “open source”, see ELKI license for details). In case of application of ELKI in scientific publications, we would appreciate credit in form of a citation of the appropriate publication (see ELKI publications), that is, the publication related to the release of ELKI you were using.
The people behind ELKI are documented on the team page.
The ELKI web page: Tutorials, HowTos, Documentation
A basic tutorial example will show you how to run k-Means and EM clustering with ELKI.
The most important documentation pages are: Tutorial, JavaDoc, FAQ, InputFormat, DataTypes, DistanceFunctions, DataSets, Development, Parameterization, Visualization, Benchmarking, and the list of Algorithms and publications implemented in ELKI.
Getting ELKI: Download and Citation Policy
There is a list of publications that accompany the ELKI releases. When using ELKI in your scientific work, you should cite the publication corresponding to the ELKI release you are using, to give credit. This also helps to improve the repeatability of your experiments. We would also appreciate if you contributed your algorithm to ELKI to allow others to reproduce your results and compare with your algorithm (which in turn will likely get you citations). We try to document every publication used for implementing ELKI: the page related publications lists over 220 publications that we used for implementing ELKI, and is generated from annotations in the source code. Also we list publications that used or cited ELKI, see references.
ELKI is compiled using Maven and Python. The compilation process is explained here.
Information on ELKI APIs and coding styles is collected at the development starting page. Please contribute!
Efficiency Benchmarking with ELKI
ELKI is fast (see some of our benchmark results) but the focus lies on a broad coverage of algorithms and variations. We discourage cross-platform benchmarking, because it is easy to produce misleading results by comparing apples and oranges. For fair comparability, you should implement all algorithms within ELKI, and use the same APIs. We have also observed Java JDK versions have a large impact on the runtime performance. To make your results reproducible, please cite the publication of the version, which you have been using.
Bug Reports and Contact
We also appreciate any comments, suggestions and code contributions.
You can contact the core development team by e-mail:
- Extensibility: ELKI has a very modular design. We want to allow arbitrary combinations of data types, distance functions, algorithms, input formats, index structures and evaluations methods
- Contributions: ELKI grows only as fast as people contribute. By having a modular design that allows small contributions such as single distance functions and single algorithms, we can have students and external contributors participate in the progress of ELKI
- Completeness: for an exhaustive comparison of methods, we aim at covering as much published and credited work as we can
- Fairness: It is easy to do an unfair comparison by badly implementing a competitor. We try to implement every method as good as we can, and by publishing the source code allow for external improvements. We try to add all proposed improvements, such as index structures for faster range and kNN queries
- Performance: the modular architecture of ELKI allows optimized versions of algorithms and index structures for acceleration
- Development Progress: ELKI is changing with every release. To accomodate new features and enhance performance, API breakages are unavoidable. We hope to get a stable API with the 1.0 release, but we are not there yet.