Class LMCLUS

  • All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<Model>>

    @Reference(authors="R. Haralick, R. Harpaz",
               title="Linear manifold clustering in high dimensional spaces by stochastic search",
               booktitle="Pattern Recognition volume 40, Issue 10",
               url="https://doi.org/10.1016/j.patcog.2007.01.020",
               bibkey="DBLP:journals/pr/HaralickH07")
    public class LMCLUS
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<Model>>
    Linear manifold clustering in high dimensional spaces by stochastic search.

    Reference:

    R. Haralick, R. Harpaz
    Linear manifold clustering in high dimensional spaces by stochastic search
    In: Pattern Recognition volume 40, Issue 10

    Implementation note: the LMCLUS algorithm seems to lack good stopping criterions. We can't entirely reproduce the good results from the original publication, in particular not on noisy data. But the questionable parts are as in the original publication, associated thesis and published source code. The minimum cluster size however can serve as a hidden stopping criterion.

    Since:
    0.5.0
    Author:
    Ernst Waas, Erich Schubert
    • Field Detail

      • LOG

        private static final Logging LOG
        The logger for this class.
      • LOG_NOT_FROM_ONE_CLUSTER_PROBABILITY

        private static final double LOG_NOT_FROM_ONE_CLUSTER_PROBABILITY
        Epsilon
      • sensitivityThreshold

        private final double sensitivityThreshold
        The current threshold value calculated by the findSeperation Method.
      • maxLMDim

        private final int maxLMDim
        Maximum cluster dimensionality
      • minsize

        private final int minsize
        Minimum cluster size
      • samplingLevel

        private final int samplingLevel
        Number of sampling rounds to find a good split
    • Constructor Detail

      • LMCLUS

        public LMCLUS​(int maxdim,
                      int minsize,
                      int samplingLevel,
                      double sensitivityThreshold,
                      RandomFactory rnd)
        Constructor.
        Parameters:
        maxdim - Maximum dimensionality
        minsize - Minimum cluster size
        samplingLevel - Sampling level
        sensitivityThreshold - Threshold
        rnd - Random factory
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Returns:
        Type restriction
      • run

        public Clustering<Model> run​(Relation<? extends NumberVector> relation)
        The main LMCLUS (Linear manifold clustering algorithm) is processed in this method.

        The algorithm samples random linear manifolds and tries to find clusters in it. It calculates a distance histogram searches for a threshold and partitions the points in two groups the ones in the cluster and everything else. Then the best fitting linear manifold is searched and registered as a cluster. The process is started over until all points are clustered. The last cluster should contain all the outliers (or the whole data if no clusters have been found.)

        Parameters:
        relation - Relation
        Returns:
        Clustering result
      • deviation

        private double deviation​(double[] delta,
                                 double[][] beta)
        Deviation from a manifold described by beta.
        Parameters:
        delta - Delta from origin vector
        beta - Manifold
        Returns:
        Deviation score
      • findSeparation

        private LMCLUS.Separation findSeparation​(Relation<? extends NumberVector> relation,
                                                 DBIDs currentids,
                                                 int dimension,
                                                 java.util.Random r)
        This method samples a number of linear manifolds an tries to determine which the one with the best cluster is.
         A number of sample points according to the dimension of the linear manifold are taken.
         The basis (B) and the origin(o) of the manifold are calculated.
         A distance histogram using  the distance function ||x-o|| -||B^t*(x-o)|| is generated.
         The best threshold is searched using the elevate threshold function.
         The overall goodness of the threshold is determined.
         The process is redone until a specific number of samples is taken.
         
        Parameters:
        relation - The vector relation
        currentids - Current DBIDs
        dimension - the dimension of the linear manifold to sample.
        r - Random generator
        Returns:
        the overall goodness of the separation. The values origin basis and threshold are returned indirectly over class variables.
      • generateOrthonormalBasis

        private double[][] generateOrthonormalBasis​(java.util.List<double[]> vectors)
        This Method generates an orthonormal basis from a set of Vectors. It uses the established Gram-Schmidt algorithm for orthonormalisation:
         u_1 = v_1
         u_k = v_k -proj_u1(v_k)...proj_u(k-1)(v_k);
        
         Where proj_u(v) = <v,u>/<u,u> *u
         
        Parameters:
        vectors - The set of vectors to generate the orthonormal basis from
        Returns:
        the orthonormal basis generated by this method.
        Throws:
        java.lang.RuntimeException - if the given vectors are not linear independent.
      • findAndEvaluateThreshold

        private double[] findAndEvaluateThreshold​(DoubleDynamicHistogram histogram)
        Evaluate the histogram to find a suitable threshold
        Parameters:
        histogram - Histogram to evaluate
        Returns:
        Position and goodness