Class DOC

    • Field Detail

      • LOG

        private static final Logging LOG
        The logger for this class.
      • alpha

        protected double alpha
        Relative density threshold parameter alpha.
      • beta

        protected double beta
        Balancing parameter for importance of points vs. dimensions
      • w

        protected double w
        Half width parameter.
      • rnd

        protected RandomFactory rnd
        Randomizer used internally for sampling points.
    • Constructor Detail

      • DOC

        public DOC​(double alpha,
                   double beta,
                   double w,
                   RandomFactory random)
        alpha - α relative density threshold.
        beta - β balancing parameter for size vs. dimensionality.
        w - half width parameter.
        random - Random factory
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Type restriction
      • run

        public Clustering<SubspaceModel> run​(Relation<? extends NumberVector> relation)
        Performs the DOC or FastDOC (as configured) algorithm.

        This will run exhaustively, i.e. run DOC until no clusters are found anymore / the database size has shrunk below the threshold for minimum cluster size.

        relation - Data relation
      • runDOC

        protected Cluster<SubspaceModel> runDOC​(Relation<? extends NumberVector> relation,
                                                ArrayModifiableDBIDs S,
                                                int d,
                                                int n,
                                                int m,
                                                int r,
                                                int minClusterSize)
        Performs a single run of DOC, finding a single cluster.
        relation - used to get actual values for DBIDs.
        S - The set of points we're working on.
        d - Dimensionality of the data set we're currently working on.
        r - Size of random samples.
        m - Number of inner iterations (per seed point).
        n - Number of outer iterations (seed points).
        minClusterSize - Minimum size a cluster must have to be accepted.
        a cluster, if one is found, else null.
      • findNeighbors

        protected DBIDs findNeighbors​(DBIDRef q,
                                      long[] nD,
                                      ArrayModifiableDBIDs S,
                                      Relation<? extends NumberVector> relation)
        Find the neighbors of point q in the given subspace
        q - Query point
        nD - Subspace mask
        S - Remaining data points
        relation - Data relation
      • dimensionIsRelevant

        protected boolean dimensionIsRelevant​(int dimension,
                                              Relation<? extends NumberVector> relation,
                                              DBIDs points)
        Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).
        dimension - the dimension to test.
        relation - used to get actual values for DBIDs.
        points - the points to test.
        true if the dimension is relevant.
      • makeCluster

        protected Cluster<SubspaceModel> makeCluster​(Relation<? extends NumberVector> relation,
                                                     DBIDs C,
                                                     long[] D)
        Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.
        relation - to compute a centroid.
        C - the cluster points.
        D - the relevant dimensions.
        an object representing the subspace cluster.
      • computeClusterQuality

        protected double computeClusterQuality​(int clusterSize,
                                               int numRelevantDimensions)
        Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.
        clusterSize - the size of the cluster.
        numRelevantDimensions - the number of dimensions relevant to the cluster.
        a quality measure (only use this to compare the quality to that other clusters).