Class EM<O,​M extends MeanModel>

  • Type Parameters:
    O - object type to analyze
    M - model type to produce
    All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<M>>

    @Title("EM-Clustering: Clustering by Expectation Maximization")
    @Description("Cluster data via Gaussian mixture modeling and the EM algorithm")
    @Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin",title="Maximum Likelihood from Incomplete Data via the EM algorithm",booktitle="Journal of the Royal Statistical Society, Series B, 39(1)",url="http://www.jstor.org/stable/2984875",bibkey="journals/jroyastatsocise2/DempsterLR77") @Reference(title="Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering",authors="C. Fraley, A. E. Raftery",booktitle="J. Classification 24(2)",url="https://doi.org/10.1007/s00357-007-0004-5",bibkey="DBLP:journals/classification/FraleyR07")
    @Priority(200)
    public class EM<O,​M extends MeanModel>
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<M>>
    Clustering by expectation maximization (EM-Algorithm), also known as Gaussian Mixture Modeling (GMM), with optional MAP regularization.

    Reference:

    A. P. Dempster, N. M. Laird, D. B. Rubin:
    Maximum Likelihood from Incomplete Data via the EM algorithm.
    Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31

    The MAP estimation is derived from

    C. Fraley and A. E. Raftery
    Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering
    J. Classification 24(2)

    Since:
    0.1
    Author:
    Arthur Zimek, Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  EM.Par<O,​M extends MeanModel>
      Parameterization class.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected double delta
      Delta parameter
      protected int k
      Number of clusters
      private static java.lang.String KEY
      Key for statistics logging.
      private static Logging LOG
      The logger for this class.
      protected int maxiter
      Maximum number of iterations to allow
      protected EMClusterModelFactory<? super O,​M> mfactory
      Factory for producing the initial cluster model.
      protected static double MIN_LOGLIKELIHOOD
      Minimum loglikelihood to avoid -infinity.
      protected int miniter
      Minimum number of iterations to do
      protected double prior
      Prior to enable MAP estimation (use 0 for MLE)
      protected boolean soft
      Retain soft assignments.
      static SimpleTypeInformation<double[]> SOFT_TYPE
      Soft assignment result type.
    • Constructor Summary

      Constructors 
      Constructor Description
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory)
      Constructor.
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory, int maxiter, boolean soft)
      Constructor.
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory, int maxiter, double prior, boolean soft)
      Constructor.
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory, int miniter, int maxiter, double prior, boolean soft)
      Constructor.
    • Field Detail

      • LOG

        private static final Logging LOG
        The logger for this class.
      • KEY

        private static final java.lang.String KEY
        Key for statistics logging.
      • k

        protected int k
        Number of clusters
      • delta

        protected double delta
        Delta parameter
      • miniter

        protected int miniter
        Minimum number of iterations to do
      • maxiter

        protected int maxiter
        Maximum number of iterations to allow
      • prior

        protected double prior
        Prior to enable MAP estimation (use 0 for MLE)
      • soft

        protected boolean soft
        Retain soft assignments.
      • MIN_LOGLIKELIHOOD

        protected static final double MIN_LOGLIKELIHOOD
        Minimum loglikelihood to avoid -infinity.
        See Also:
        Constant Field Values
    • Constructor Detail

      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory,
                  int maxiter,
                  boolean soft)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
        maxiter - Maximum number of iterations
        soft - Include soft assignments
      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory,
                  int maxiter,
                  double prior,
                  boolean soft)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
        maxiter - Maximum number of iterations
        prior - MAP prior
        soft - Include soft assignments
      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory,
                  int miniter,
                  int maxiter,
                  double prior,
                  boolean soft)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
        miniter - Minimum number of iterations
        maxiter - Maximum number of iterations
        prior - MAP prior
        soft - Include soft assignments
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Returns:
        Type restriction
      • run

        public Clustering<M> run​(Relation<O> relation)
        Performs the EM clustering algorithm on the given database.

        Finally a hard clustering is provided where each clusters gets assigned the points exhibiting the highest probability to belong to this cluster. But still, the database objects hold associated the complete probability-vector for all models.

        Parameters:
        relation - Relation
        Returns:
        Clustering result
      • recomputeCovarianceMatrices

        public static <O> void recomputeCovarianceMatrices​(Relation<? extends O> relation,
                                                           WritableDataStore<double[]> probClusterIGivenX,
                                                           java.util.List<? extends EMClusterModel<? super O,​?>> models,
                                                           double prior)
        Recompute the covariance matrixes.
        Type Parameters:
        O - Object type
        Parameters:
        relation - Vector data
        probClusterIGivenX - Object probabilities
        models - Cluster models to update
        prior - MAP prior (use 0 for MLE)
      • assignProbabilitiesToInstances

        public static <O> double assignProbabilitiesToInstances​(Relation<? extends O> relation,
                                                                java.util.List<? extends EMClusterModel<? super O,​?>> models,
                                                                WritableDataStore<double[]> probClusterIGivenX,
                                                                WritableDoubleDataStore loglikelihoods)
        Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.

        Computed as the sum of the logarithms of the prior probability of each instance.

        Type Parameters:
        O - Object type
        Parameters:
        relation - the database used for assignment to instances
        models - Cluster models
        probClusterIGivenX - Output storage for cluster probabilities
        loglikelihoods - Per-object log likelihood, for EM Outlier; may be null if not used
        Returns:
        the expectation value of the current mixture of distributions
      • logSumExp

        public static double logSumExp​(double[] x)
        Compute log(sum(exp(x_i)), with attention to numerical issues.
        Parameters:
        x - Input
        Returns:
        Result
      • logSumExp

        protected static double logSumExp​(double a,
                                          double b)
        Compute log(exp(a)+exp(b)), with attention to numerical issues.
        Parameters:
        a - Input 1
        b - Input 2
        Returns:
        Result