Using Indexes for accelerated algorithms
Many data mining algorithms can be accelerated by using appropriate index structures. Which index is appropriate largely depends on the type of query that the algorithm performs against the database.
Common types of queries that can be accelerated include:
- k nearest neighbor queries KNNSearcher
- range queries RangeSearcher
- reverse k nearest neighbor queries RKNNSearcher
- priority search PrioritySearcher
- similarity range search SimilarityQuery
- window queries (currently not yet implemented, similar to range queries)
Not every index can accelerated each query equally well (or at all). In particular reverse kNN queries need highly specialized index structures.
ELKI 0.8.0 will, in many cases, automatically create a suitable index for you. For low-dimensional data and Minkowski-type distances, it will prefer a k-d-tree. For other metrics, it will usually prefer a vantage point (aka VP-tree, ball tree).
The R-tree family is a very well established index structure. With algorithms such as Sort-Tile-Recursive the tree can very efficiently bulk-loaded, while the R*-tree tries to keep the tree efficient while performing modifications to it.
R-trees are very flexible, and can accelerate any distance function for wich a reasonable point to rectangle minimum distance can be defined. In ELKI, any class implementing the SpatialPrimitiveDistance can be used. This includes in particular Euclidean and other Minkowski norms, but to some extend also cosine distance can be accelerated. In contrast to M-trees below, the index supports all of these distances at the same time.
Using R-trees in ELKI is simple, you just need to enable the RStarTreeFactory via the parameters:
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory -treeindex.pagesize 4096 -spatial.bulkstrategy SortTileRecursiveBulkSplit
The optimal page size is data set and use dependant, in particular the dimensionality and the average number of objects to return in each query play an important role. Since most users are using a static data set, using a bulk load such as SortTileRecursiveBulkSplit is recommended.
M-Trees, also known as Ball-tree, are specialized trees. They can be used with any distance function that is metrical, i.e. that satisfies the triangle inequality. Futhermore, an M-tree needs to know the distance function at construction time, it cannot be queried with arbitrary distances.
To use an M-Tree, again set the
-db.index parameter to the appropriate factory class MTreeFactory:
-db.index tree.metrical.mtreevariants.mtree.MTreeFactory -treeindex.pagesize 4096 -mtree.distancefunction EuclideanDistance
Note that you need to set the same distance function that you are using in your algorithm later on, and that the distance function needs to be metrical. The current implementation of M-Trees in ELKI support dynamic insertion, but no deletions. Please contribute!
ELKI includes lightweight k-d-trees, as in-memory index structures only. k-d-trees only support few distance functions that have one-dimensional deltas as lower bounds.
See Distances for the full list of available distance functions. This table only lists distance functions where the indexing support is somewhat analyzed. Note that metric trees such as the M-tree, cover tree, and vantage point tree (aka Ball tree) must be built with this distance function.
|Distance Function||R-Tree||M-Tree / Covertree / VP-Tree||VA-File||k-d-tree||LSH|
|N||Not possible (e.g. not metric)|
|?||Unknown or not implemented|
|*||Only in certain situations (additional constraints, e.g. metric)|