Synthetic Data for Shared-Nearest-Neighbors

These data sets were originally created for the publication:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

Variantsize
all-relevant 10d 20d 40d 80d 160d 320d 640d
10-relevant 10d 20d 40d 80d 160d 320d 640d
cyc-relevant 10d 20d 40d 80d 160d 320d 640d
half-relevant 10d 20d 40d 80d 160d 320d 640d
all-dependent 10d 20d 40d 80d 160d 320d 640d
10-dependent 10d 20d 40d 80d 160d 320d 640d

All sizes are derived from the 640 dimensional version by keeping the first n dimensions.

Data Generator Specifications

These data sets were generated with the data generator included in ELKI (although using an older version of ELKI, that for example used a different random number generator), using the following XML data specifications:

all-relevant cyc-relevant half-relevant

Then only the first 10,20,… dimensions were retained to produce the subsets of each dimensionality.

Simplified versions of the all-relevant data set:

The following versions (not used in the article) of the all-relevant data set have been simplified by scaling the cluster standard deviations, thus making the clusters easier separable and easier to index:

Rescaled standard deviations of all-relevant (for use with the ELKI data generator): 0.75 0.50 0.25 0.10