Sparse Data in ELKI

While ELKI has some support for sparse data, not everything will work with it - but a lot does.

Many algorithms are implemented in a data-agnostic way. For example, many outlier detection algorithms only need a distance function, and hence will work well with sparse data. Similarly, DBSCAN will work - but standard k-means does not, because it requires the data to be a d-dimensional vector space where means are computed.

Some index structures such as R-Trees are known to not scale very well to high-dimensional data, and sparse data data is high-dimensional. Others, such as covertrees and VP-trees only need a metric, and do not care about the underyling data representation at all.

Loading sparse data

While for dense vectors there is the common format of CSV files (or whitespace separated, as in Gnuplot), for sparse vectors there are multiple more or less compatible formats. You will have to choose the appropriate input format first (or write your own parser!)

Memory representation

ELKI 0.8.0 includes two types of true sparse vectors, SparseFloatVector and SparseDoubleVector that vary obvious in memory and precision. Further releases may also include different variants because of memory and performance differences. Some parsers can also produce bit vectors and dense vectors, if needed.

Data filtering

ELKI parsers try to operate streaming. As such, sparse vectors will vary in dimensionality, and not form a vector field. Some algorithms and distance functions expect all vectors to have the same dimensionality. The filter SparseVectorFieldFilter will trivially set the maximum dimensionality by re-scanning the data set in memory.

For text data, a common normalization of data is TF-IDF. The filter InverseDocumentFrequencyNormalization will normalize vectors by their IDF accordingly.

Distance functions

While most distance functions will accept any number vector field, there are a number of distance functions implemented that are optimized for sparse vectors and allow the dimensionality of the two vectors to vary. SparseEuclideanDistance for example is faster for sparse vectors and does not have this restriction. We appreciate contributions of additional optimized implementations to ELKI!

Two views of sparse data

Depending on the context, sparse vectors can be interpreted in two different ways that are equally reasonable.

If you want to have better control of the handling of missing attributes, you may consider to implement your own distance function with explicit handling for missing values.