Class BetulaGMM

  • All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<EMModel>>
    Direct Known Subclasses:
    BetulaGMMWeighted

    @Reference(authors="Andreas Lang and Erich Schubert",
               title="BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees",
               booktitle="Information Systems",
               url="https://doi.org/10.1016/j.is.2021.101918",
               bibkey="DBLP:journals/is/LangS22")
    public class BetulaGMM
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<EMModel>>
    Clustering by expectation maximization (EM-Algorithm), also known as Gaussian Mixture Modeling (GMM), with optional MAP regularization. This version uses the BIRCH cluster feature centers only for responsibility estimation; the CF variances are only used for computing the models.

    Reference:

    Andreas Lang and Erich Schubert
    BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees
    Information Systems

    Since:
    0.8.0
    Author:
    Andreas Lang
    • Field Detail

      • LOG

        private static final Logging LOG
        Class logger.
      • k

        int k
        Number of cluster centers to initialize.
      • delta

        private double delta
        Delta parameter
      • maxiter

        int maxiter
        Maximum number of iterations.
      • prior

        private double prior
        Prior to enable MAP estimation (use 0 for MLE)
      • soft

        private boolean soft
        Retain soft assignments.
      • MIN_LOGLIKELIHOOD

        protected static final double MIN_LOGLIKELIHOOD
        Minimum loglikelihood to avoid -infinity.
        See Also:
        Constant Field Values
    • Constructor Detail

      • BetulaGMM

        public BetulaGMM​(CFTree.Factory<?> cffactory,
                         double delta,
                         int k,
                         int maxiter,
                         boolean soft,
                         BetulaClusterModelFactory<?> initialization,
                         double prior)
        Constructor.
        Parameters:
        cffactory - CFTree factory
        k - Number of clusters
        maxiter - Maximum number of iterations
        soft - Return soft clustering results
        initialization - Initialization method
        prior - MAP prior
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Returns:
        Type restriction
      • isSoft

        private boolean isSoft()
      • assignProbabilitiesToInstances

        public double assignProbabilitiesToInstances​(java.util.ArrayList<? extends ClusterFeature> cfs,
                                                     java.util.List<? extends BetulaClusterModel> models,
                                                     java.util.Map<ClusterFeature,​double[]> probClusterIGivenX)
        Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.

        Computed as the sum of the logarithms of the prior probability of each instance.

        Parameters:
        cfs - the cluster features to evaluate
        models - Cluster models
        probClusterIGivenX - Output storage for cluster probabilities
        Returns:
        the expectation value of the current mixture of distributions
      • assignProbabilitiesToInstances

        public double assignProbabilitiesToInstances​(Relation<? extends NumberVector> relation,
                                                     java.util.List<? extends BetulaClusterModel> models,
                                                     WritableDataStore<double[]> probClusterIGivenX)
        Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.

        Computed as the sum of the logarithms of the prior probability of each instance.

        Parameters:
        relation - the database used for assignment to instances
        models - Cluster models
        probClusterIGivenX - Output storage for cluster probabilities
        Returns:
        the expectation value of the current mixture of distributions
      • recomputeCovarianceMatrices

        public void recomputeCovarianceMatrices​(java.util.ArrayList<? extends ClusterFeature> cfs,
                                                java.util.Map<ClusterFeature,​double[]> probClusterIGivenX,
                                                java.util.List<? extends BetulaClusterModel> models,
                                                double prior,
                                                int n)
        Recompute the covariance matrixes.
        Parameters:
        cfs - Cluster features to evaluate
        probClusterIGivenX - Object probabilities
        models - Cluster models to update
        prior - MAP prior (use 0 for MLE)
        n - data set size