Class CASH

  • All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<Model>>

    @Title("CASH: Robust clustering in arbitrarily oriented subspaces")
    @Description("Subspace clustering algorithm based on the Hough transform.")
    @Reference(authors="Elke Achtert, Christian B\u00f6hm, J\u00f6rn David, Peer Kr\u00f6ger, Arthur Zimek",
               title="Robust clustering in arbitrarily oriented subspaces",
               booktitle="Proc. 8th SIAM Int. Conf. on Data Mining (SDM\'08)",
               url="https://doi.org/10.1137/1.9781611972788.69",
               bibkey="DBLP:conf/sdm/AchtertBDKZ08")
    public class CASH
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<Model>>
    The CASH algorithm is a subspace clustering algorithm based on the Hough transform.

    Reference:

    Elke Achtert, Christian Böhm, Jörn David, Peer Kröger, Arthur Zimek
    Robust clustering in arbitrarily oriented subspaces.
    In Proc. 8th SIAM Int. Conf. on Data Mining (SDM'08)

    Since:
    0.1
    Author:
    Elke Achtert
    • Field Detail

      • LOG

        private static final Logging LOG
        The logger for this class.
      • minPts

        protected int minPts
        Threshold for minimum number of points in a cluster
      • maxLevel

        protected int maxLevel
        Maximum level for splitting the hypercube.
      • minDim

        protected int minDim
        Minimum dimensionality of the subspaces to be found
      • jitter

        protected double jitter
        Maximum jitter for distance values.
      • adjust

        protected boolean adjust
        Apply adjustment heuristic for interval choosing.
      • noiseDim

        private int noiseDim
        Holds the dimensionality for noise.
      • processedIDs

        private ModifiableDBIDs processedIDs
        Holds a set of processed ids.
    • Constructor Detail

      • CASH

        public CASH​(int minPts,
                    int maxLevel,
                    int minDim,
                    double jitter,
                    boolean adjust)
        Constructor.
        Parameters:
        minPts - MinPts parameter
        maxLevel - Maximum level
        minDim - Minimum dimensionality
        jitter - Jitter
        adjust - Adjust
    • Method Detail

      • doRun

        private Clustering<Model> doRun​(Relation<ParameterizationFunction> relation,
                                        FiniteProgress progress)
        Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.
        Parameters:
        relation - the Relation to run the CASH algorithm on
        progress - the progress object for verbose messages
        Returns:
        a mapping of subspace dimensionalities to clusters
      • dimensionality

        private static int dimensionality​(Relation<ParameterizationFunction> relation)
        Get the dimensionality of a vector field.
        Parameters:
        relation - Relation
        Returns:
        Dimensionality
      • initHeap

        private void initHeap​(ObjectHeap<CASHInterval> heap,
                              Relation<ParameterizationFunction> relation,
                              int dim,
                              DBIDs ids)
        Initializes the heap with the root intervals.
        Parameters:
        heap - the heap to be initialized
        relation - the database storing the parameterization functions
        dim - the dimensionality of the database
        ids - the ids of the database
      • buildDB

        private MaterializedRelation<ParameterizationFunction> buildDB​(int dim,
                                                                       double[][] basis,
                                                                       DBIDs ids,
                                                                       Relation<ParameterizationFunction> relation)
        Builds a dim-1 dimensional database where the objects are projected into the specified subspace.
        Parameters:
        dim - the dimensionality of the database
        basis - the basis defining the subspace
        ids - the ids for the new database
        relation - the database storing the parameterization functions
        Returns:
        a dim-1 dimensional database where the objects are projected into the specified subspace
      • project

        private ParameterizationFunction project​(double[][] basis,
                                                 ParameterizationFunction f)
        Projects the specified parameterization function into the subspace described by the given basis.
        Parameters:
        basis - the basis defining he subspace
        f - the parameterization function to be projected
        Returns:
        the projected parameterization function
      • determineBasis

        private double[][] determineBasis​(double[] alpha)
        Determines a basis defining a subspace described by the specified alpha values.
        Parameters:
        alpha - the alpha values
        Returns:
        a basis defining a subspace described by the specified alpha values
      • determineNextIntervalAtMaxLevel

        private CASHInterval determineNextIntervalAtMaxLevel​(ObjectHeap<CASHInterval> heap)
        Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.
        Parameters:
        heap - the heap storing the intervals
        Returns:
        the next ''best'' interval at maximum level
      • doDetermineNextIntervalAtMaxLevel

        private CASHInterval doDetermineNextIntervalAtMaxLevel​(ObjectHeap<CASHInterval> heap)
        Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects
        Parameters:
        heap - the heap storing the intervals
        Returns:
        the next ''best'' interval at maximum level
      • determineMinMaxDistance

        private double[] determineMinMaxDistance​(Relation<ParameterizationFunction> relation,
                                                 int dimensionality)
        Determines the minimum and maximum function value of all parameterization functions stored in the specified database.
        Parameters:
        relation - the database containing the parameterization functions.
        dimensionality - the dimensionality of the database
        Returns:
        an array containing the minimum and maximum function value of all parameterization functions stored in the specified database
      • runDerivator

        private double[][] runDerivator​(Relation<ParameterizationFunction> relation,
                                        int dim,
                                        CASHInterval interval,
                                        ModifiableDBIDs outids)
        Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.
        Parameters:
        relation - the database containing the parameterization functions
        interval - the interval to build the model
        dim - the dimensionality of the database
        outids - an empty set to assign the ids
        Returns:
        a basis of the found subspace
      • runDerivator

        private LinearEquationSystem runDerivator​(Relation<ParameterizationFunction> relation,
                                                  int dimensionality,
                                                  DBIDs ids)
        Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.
        Parameters:
        relation - the database containing the parameterization functions
        ids - the ids to build the model
        dimensionality - the dimensionality of the subspace
        Returns:
        a basis of the found subspace
      • buildDerivatorDB

        private Relation<DoubleVector> buildDerivatorDB​(Relation<ParameterizationFunction> relation,
                                                        DBIDs ids)
        Builds a database for the derivator consisting of the ids in the specified interval.
        Parameters:
        relation - the database storing the parameterization functions
        ids - the ids to build the database from
        Returns:
        a database for the derivator consisting of the ids in the specified interval
      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Returns:
        Type restriction