Using Indexes for accelerated algorithms

Many data mining algorithms can be accelerated by using appropriate index structures. Which index is appropriate largely depends on the type of query that the algorithm performs against the database.

Common types of queries that can be accelerated include:

Not every index can accelerated each query equally well (or at all). In particular reverse kNN queries need highly specialized index structures.

Automatic indexing

ELKI 0.8.0 will, in many cases, automatically create a suitable index for you. For low-dimensional data and Minkowski-type distances, it will prefer a k-d-tree. For other metrics, it will usually prefer a vantage point (aka VP-tree, ball tree).

R-Trees

The R-tree family is a very well established index structure. With algorithms such as Sort-Tile-Recursive the tree can very efficiently bulk-loaded, while the R*-tree tries to keep the tree efficient while performing modifications to it.

R-trees are very flexible, and can accelerate any distance function for wich a reasonable point to rectangle minimum distance can be defined. In ELKI, any class implementing the SpatialPrimitiveDistance can be used. This includes in particular Euclidean and other Minkowski norms, but to some extend also cosine distance can be accelerated. In contrast to M-trees below, the index supports all of these distances at the same time.

Using R-trees in ELKI is simple, you just need to enable the RStarTreeFactory via the parameters:

-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-treeindex.pagesize 4096
-spatial.bulkstrategy SortTileRecursiveBulkSplit

The optimal page size is data set and use dependant, in particular the dimensionality and the average number of objects to return in each query play an important role. Since most users are using a static data set, using a bulk load such as SortTileRecursiveBulkSplit is recommended.

M-Trees

M-Trees, also known as Ball-tree, are specialized trees. They can be used with any distance function that is metrical, i.e. that satisfies the triangle inequality. Futhermore, an M-tree needs to know the distance function at construction time, it cannot be queried with arbitrary distances.

To use an M-Tree, again set the -db.index parameter to the appropriate factory class MTreeFactory:

-db.index tree.metrical.mtreevariants.mtree.MTreeFactory
-treeindex.pagesize 4096
-mtree.distancefunction EuclideanDistance

Note that you need to set the same distance function that you are using in your algorithm later on, and that the distance function needs to be metrical. The current implementation of M-Trees in ELKI support dynamic insertion, but no deletions. Please contribute!

k-d-Trees

ELKI includes lightweight k-d-trees, as in-memory index structures only. k-d-trees only support few distance functions that have one-dimensional deltas as lower bounds.

Compatibility matrix

See Distances for the full list of available distance functions. This table only lists distance functions where the indexing support is somewhat analyzed. Note that metric trees such as the M-tree, cover tree, and vantage point tree (aka Ball tree) must be built with this distance function.

Distance Function R-Tree M-Tree / Covertree / VP-Tree VA-File k-d-tree LSH
EuclideanDistance Y Y Y Y Y
ManhattanDistance Y Y Y Y Y
LPNormDistance Y * Y Y *
MaximumDistance Y Y ? Y ?
MinimumDistance Y N ? ? ?
SquaredEuclideanDistance Y N ? Y ?
ArcCosineDistance Y Y ? ? ?
CosineDistance Y N ? ? ?
SqrtCosineDistance Y Y ? ? ?
CanberraDistance Y Y ? ? ?
LatLngDistance Y Y ? ? ?
LngLatDistance Y Y ? ? ?
HistogramIntersectionDistance Y Y ? ? ?
Flag Meaning
Y Supported
N Not possible (e.g. not metric)
? Unknown or not implemented
* Only in certain situations (additional constraints, e.g. metric)