Package elki.clustering.subspace
Class DOC
- java.lang.Object
-
- elki.clustering.subspace.DOC
-
- All Implemented Interfaces:
Algorithm,ClusteringAlgorithm<Clustering<SubspaceModel>>,SubspaceClusteringAlgorithm<SubspaceModel>
- Direct Known Subclasses:
FastDOC
@Title("DOC: Density-based Optimal projective Clustering") @Reference(authors="C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali", title="A Monte Carlo algorithm for fast projective clustering", booktitle="Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD \'02)", url="https://doi.org/10.1145/564691.564739", bibkey="DBLP:conf/sigmod/ProcopiucJAM02") public class DOC extends java.lang.Object implements SubspaceClusteringAlgorithm<SubspaceModel>
DOC is a sampling based subspace clustering algorithm.Reference:
C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali
A Monte Carlo algorithm for fast projective clustering
In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD '02).- Since:
- 0.6.0
- Author:
- Florian Nuecke
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classDOC.ParParameterization class.-
Nested classes/interfaces inherited from interface elki.Algorithm
Algorithm.Utils
-
-
Field Summary
Fields Modifier and Type Field Description protected doublealphaRelative density threshold parameter alpha.protected doublebetaBalancing parameter for importance of points vs. dimensionsprivate static LoggingLOGThe logger for this class.protected RandomFactoryrndRandomizer used internally for sampling points.protected doublewHalf width parameter.
-
Constructor Summary
Constructors Constructor Description DOC(double alpha, double beta, double w, RandomFactory random)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected doublecomputeClusterQuality(int clusterSize, int numRelevantDimensions)Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.protected booleandimensionIsRelevant(int dimension, Relation<? extends NumberVector> relation, DBIDs points)Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).protected DBIDsfindNeighbors(DBIDRef q, long[] nD, ArrayModifiableDBIDs S, Relation<? extends NumberVector> relation)Find the neighbors of point q in the given subspaceTypeInformation[]getInputTypeRestriction()Get the input type restriction used for negotiating the data query.protected Cluster<SubspaceModel>makeCluster(Relation<? extends NumberVector> relation, DBIDs C, long[] D)Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.Clustering<SubspaceModel>run(Relation<? extends NumberVector> relation)Performs the DOC or FastDOC (as configured) algorithm.protected Cluster<SubspaceModel>runDOC(Relation<? extends NumberVector> relation, ArrayModifiableDBIDs S, int d, int n, int m, int r, int minClusterSize)Performs a single run of DOC, finding a single cluster.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface elki.clustering.ClusteringAlgorithm
autorun
-
-
-
-
Field Detail
-
LOG
private static final Logging LOG
The logger for this class.
-
alpha
protected double alpha
Relative density threshold parameter alpha.
-
beta
protected double beta
Balancing parameter for importance of points vs. dimensions
-
w
protected double w
Half width parameter.
-
rnd
protected RandomFactory rnd
Randomizer used internally for sampling points.
-
-
Constructor Detail
-
DOC
public DOC(double alpha, double beta, double w, RandomFactory random)Constructor.- Parameters:
alpha- α relative density threshold.beta- β balancing parameter for size vs. dimensionality.w- half width parameter.random- Random factory
-
-
Method Detail
-
getInputTypeRestriction
public TypeInformation[] getInputTypeRestriction()
Description copied from interface:AlgorithmGet the input type restriction used for negotiating the data query.- Specified by:
getInputTypeRestrictionin interfaceAlgorithm- Returns:
- Type restriction
-
run
public Clustering<SubspaceModel> run(Relation<? extends NumberVector> relation)
Performs the DOC or FastDOC (as configured) algorithm.This will run exhaustively, i.e. run DOC until no clusters are found anymore / the database size has shrunk below the threshold for minimum cluster size.
- Parameters:
relation- Data relation
-
runDOC
protected Cluster<SubspaceModel> runDOC(Relation<? extends NumberVector> relation, ArrayModifiableDBIDs S, int d, int n, int m, int r, int minClusterSize)
Performs a single run of DOC, finding a single cluster.- Parameters:
relation- used to get actual values for DBIDs.S- The set of points we're working on.d- Dimensionality of the data set we're currently working on.r- Size of random samples.m- Number of inner iterations (per seed point).n- Number of outer iterations (seed points).minClusterSize- Minimum size a cluster must have to be accepted.- Returns:
- a cluster, if one is found, else
null.
-
findNeighbors
protected DBIDs findNeighbors(DBIDRef q, long[] nD, ArrayModifiableDBIDs S, Relation<? extends NumberVector> relation)
Find the neighbors of point q in the given subspace- Parameters:
q- Query pointnD- Subspace maskS- Remaining data pointsrelation- Data relation- Returns:
- Neighbors
-
dimensionIsRelevant
protected boolean dimensionIsRelevant(int dimension, Relation<? extends NumberVector> relation, DBIDs points)Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).- Parameters:
dimension- the dimension to test.relation- used to get actual values for DBIDs.points- the points to test.- Returns:
trueif the dimension is relevant.
-
makeCluster
protected Cluster<SubspaceModel> makeCluster(Relation<? extends NumberVector> relation, DBIDs C, long[] D)
Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.- Parameters:
relation- to compute a centroid.C- the cluster points.D- the relevant dimensions.- Returns:
- an object representing the subspace cluster.
-
computeClusterQuality
protected double computeClusterQuality(int clusterSize, int numRelevantDimensions)Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.- Parameters:
clusterSize- the size of the cluster.numRelevantDimensions- the number of dimensions relevant to the cluster.- Returns:
- a quality measure (only use this to compare the quality to that other clusters).
-
-