Synthetic Data for Shared-Nearest-Neighbors
These data sets were originally created for the publication:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
Variant | size | ||||||
---|---|---|---|---|---|---|---|
all-relevant | 10d | 20d | 40d | 80d | 160d | 320d | 640d |
10-relevant | 10d | 20d | 40d | 80d | 160d | 320d | 640d |
cyc-relevant | 10d | 20d | 40d | 80d | 160d | 320d | 640d |
half-relevant | 10d | 20d | 40d | 80d | 160d | 320d | 640d |
all-dependent | 10d | 20d | 40d | 80d | 160d | 320d | 640d |
10-dependent | 10d | 20d | 40d | 80d | 160d | 320d | 640d |
All sizes are derived from the 640 dimensional version by keeping the first n dimensions.
Data Generator Specifications
These data sets were generated with the data generator included in ELKI (although using an older version of ELKI, that for example used a different random number generator), using the following XML data specifications:
all-relevant cyc-relevant half-relevant
Then only the first 10,20,… dimensions were retained to produce the subsets of each dimensionality.
Simplified versions of the all-relevant data set:
The following versions (not used in the article) of the all-relevant data set have been simplified by scaling the cluster standard deviations, thus making the clusters easier separable and easier to index:
Rescaled standard deviations of all-relevant (for use with the ELKI data generator): 0.75 0.50 0.25 0.10