Package elki.clustering.subspace
Class DOC
- java.lang.Object
-
- elki.clustering.subspace.DOC
-
- All Implemented Interfaces:
Algorithm
,ClusteringAlgorithm<Clustering<SubspaceModel>>
,SubspaceClusteringAlgorithm<SubspaceModel>
- Direct Known Subclasses:
FastDOC
@Title("DOC: Density-based Optimal projective Clustering") @Reference(authors="C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali", title="A Monte Carlo algorithm for fast projective clustering", booktitle="Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD \'02)", url="https://doi.org/10.1145/564691.564739", bibkey="DBLP:conf/sigmod/ProcopiucJAM02") public class DOC extends java.lang.Object implements SubspaceClusteringAlgorithm<SubspaceModel>
DOC is a sampling based subspace clustering algorithm.Reference:
C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali
A Monte Carlo algorithm for fast projective clustering
In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD '02).- Since:
- 0.6.0
- Author:
- Florian Nuecke
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DOC.Par
Parameterization class.-
Nested classes/interfaces inherited from interface elki.Algorithm
Algorithm.Utils
-
-
Field Summary
Fields Modifier and Type Field Description protected double
alpha
Relative density threshold parameter alpha.protected double
beta
Balancing parameter for importance of points vs. dimensionsprivate static Logging
LOG
The logger for this class.protected RandomFactory
rnd
Randomizer used internally for sampling points.protected double
w
Half width parameter.
-
Constructor Summary
Constructors Constructor Description DOC(double alpha, double beta, double w, RandomFactory random)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected double
computeClusterQuality(int clusterSize, int numRelevantDimensions)
Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.protected boolean
dimensionIsRelevant(int dimension, Relation<? extends NumberVector> relation, DBIDs points)
Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).protected DBIDs
findNeighbors(DBIDRef q, long[] nD, ArrayModifiableDBIDs S, Relation<? extends NumberVector> relation)
Find the neighbors of point q in the given subspaceTypeInformation[]
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.protected Cluster<SubspaceModel>
makeCluster(Relation<? extends NumberVector> relation, DBIDs C, long[] D)
Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.Clustering<SubspaceModel>
run(Relation<? extends NumberVector> relation)
Performs the DOC or FastDOC (as configured) algorithm.protected Cluster<SubspaceModel>
runDOC(Relation<? extends NumberVector> relation, ArrayModifiableDBIDs S, int d, int n, int m, int r, int minClusterSize)
Performs a single run of DOC, finding a single cluster.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface elki.clustering.ClusteringAlgorithm
autorun
-
-
-
-
Field Detail
-
LOG
private static final Logging LOG
The logger for this class.
-
alpha
protected double alpha
Relative density threshold parameter alpha.
-
beta
protected double beta
Balancing parameter for importance of points vs. dimensions
-
w
protected double w
Half width parameter.
-
rnd
protected RandomFactory rnd
Randomizer used internally for sampling points.
-
-
Constructor Detail
-
DOC
public DOC(double alpha, double beta, double w, RandomFactory random)
Constructor.- Parameters:
alpha
- α relative density threshold.beta
- β balancing parameter for size vs. dimensionality.w
- half width parameter.random
- Random factory
-
-
Method Detail
-
getInputTypeRestriction
public TypeInformation[] getInputTypeRestriction()
Description copied from interface:Algorithm
Get the input type restriction used for negotiating the data query.- Specified by:
getInputTypeRestriction
in interfaceAlgorithm
- Returns:
- Type restriction
-
run
public Clustering<SubspaceModel> run(Relation<? extends NumberVector> relation)
Performs the DOC or FastDOC (as configured) algorithm.This will run exhaustively, i.e. run DOC until no clusters are found anymore / the database size has shrunk below the threshold for minimum cluster size.
- Parameters:
relation
- Data relation
-
runDOC
protected Cluster<SubspaceModel> runDOC(Relation<? extends NumberVector> relation, ArrayModifiableDBIDs S, int d, int n, int m, int r, int minClusterSize)
Performs a single run of DOC, finding a single cluster.- Parameters:
relation
- used to get actual values for DBIDs.S
- The set of points we're working on.d
- Dimensionality of the data set we're currently working on.r
- Size of random samples.m
- Number of inner iterations (per seed point).n
- Number of outer iterations (seed points).minClusterSize
- Minimum size a cluster must have to be accepted.- Returns:
- a cluster, if one is found, else
null
.
-
findNeighbors
protected DBIDs findNeighbors(DBIDRef q, long[] nD, ArrayModifiableDBIDs S, Relation<? extends NumberVector> relation)
Find the neighbors of point q in the given subspace- Parameters:
q
- Query pointnD
- Subspace maskS
- Remaining data pointsrelation
- Data relation- Returns:
- Neighbors
-
dimensionIsRelevant
protected boolean dimensionIsRelevant(int dimension, Relation<? extends NumberVector> relation, DBIDs points)
Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).- Parameters:
dimension
- the dimension to test.relation
- used to get actual values for DBIDs.points
- the points to test.- Returns:
true
if the dimension is relevant.
-
makeCluster
protected Cluster<SubspaceModel> makeCluster(Relation<? extends NumberVector> relation, DBIDs C, long[] D)
Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.- Parameters:
relation
- to compute a centroid.C
- the cluster points.D
- the relevant dimensions.- Returns:
- an object representing the subspace cluster.
-
computeClusterQuality
protected double computeClusterQuality(int clusterSize, int numRelevantDimensions)
Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.- Parameters:
clusterSize
- the size of the cluster.numRelevantDimensions
- the number of dimensions relevant to the cluster.- Returns:
- a quality measure (only use this to compare the quality to that other clusters).
-
-