Package elki.clustering.em
Class EM<O,M extends MeanModel>
- java.lang.Object
-
- elki.clustering.em.EM<O,M>
-
- Type Parameters:
O
- object type to analyzeM
- model type to produce
- All Implemented Interfaces:
Algorithm
,ClusteringAlgorithm<Clustering<M>>
@Title("EM-Clustering: Clustering by Expectation Maximization") @Description("Cluster data via Gaussian mixture modeling and the EM algorithm") @Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin",title="Maximum Likelihood from Incomplete Data via the EM algorithm",booktitle="Journal of the Royal Statistical Society, Series B, 39(1)",url="http://www.jstor.org/stable/2984875",bibkey="journals/jroyastatsocise2/DempsterLR77") @Reference(title="Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering",authors="C. Fraley, A. E. Raftery",booktitle="J. Classification 24(2)",url="https://doi.org/10.1007/s00357-007-0004-5",bibkey="DBLP:journals/classification/FraleyR07") @Priority(200) public class EM<O,M extends MeanModel> extends java.lang.Object implements ClusteringAlgorithm<Clustering<M>>
Clustering by expectation maximization (EM-Algorithm), also known as Gaussian Mixture Modeling (GMM), with optional MAP regularization.Reference:
A. P. Dempster, N. M. Laird, D. B. Rubin:
Maximum Likelihood from Incomplete Data via the EM algorithm.
Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31The MAP estimation is derived from
C. Fraley and A. E. Raftery
Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering
J. Classification 24(2)- Since:
- 0.1
- Author:
- Arthur Zimek, Erich Schubert
-
-
Field Summary
Fields Modifier and Type Field Description protected double
delta
Delta parameterprotected int
k
Number of clustersprivate static java.lang.String
KEY
Key for statistics logging.private static Logging
LOG
The logger for this class.protected int
maxiter
Maximum number of iterations to allowprotected EMClusterModelFactory<? super O,M>
mfactory
Factory for producing the initial cluster model.protected static double
MIN_LOGLIKELIHOOD
Minimum loglikelihood to avoid -infinity.protected int
miniter
Minimum number of iterations to doprotected double
prior
Prior to enable MAP estimation (use 0 for MLE)protected boolean
soft
Retain soft assignments.static SimpleTypeInformation<double[]>
SOFT_TYPE
Soft assignment result type.
-
Constructor Summary
Constructors Constructor Description EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory)
Constructor.EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory, int maxiter, boolean soft)
Constructor.EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory, int maxiter, double prior, boolean soft)
Constructor.EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory, int miniter, int maxiter, double prior, boolean soft)
Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static <O> double
assignProbabilitiesToInstances(Relation<? extends O> relation, java.util.List<? extends EMClusterModel<? super O,?>> models, WritableDataStore<double[]> probClusterIGivenX, WritableDoubleDataStore loglikelihoods)
Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.TypeInformation[]
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.static double
logSumExp(double[] x)
Compute log(sum(exp(x_i)), with attention to numerical issues.protected static double
logSumExp(double a, double b)
Compute log(exp(a)+exp(b)), with attention to numerical issues.static <O> void
recomputeCovarianceMatrices(Relation<? extends O> relation, WritableDataStore<double[]> probClusterIGivenX, java.util.List<? extends EMClusterModel<? super O,?>> models, double prior)
Recompute the covariance matrixes.Clustering<M>
run(Relation<O> relation)
Performs the EM clustering algorithm on the given database.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface elki.clustering.ClusteringAlgorithm
autorun
-
-
-
-
Field Detail
-
LOG
private static final Logging LOG
The logger for this class.
-
KEY
private static final java.lang.String KEY
Key for statistics logging.
-
k
protected int k
Number of clusters
-
delta
protected double delta
Delta parameter
-
mfactory
protected EMClusterModelFactory<? super O,M extends MeanModel> mfactory
Factory for producing the initial cluster model.
-
miniter
protected int miniter
Minimum number of iterations to do
-
maxiter
protected int maxiter
Maximum number of iterations to allow
-
prior
protected double prior
Prior to enable MAP estimation (use 0 for MLE)
-
soft
protected boolean soft
Retain soft assignments.
-
MIN_LOGLIKELIHOOD
protected static final double MIN_LOGLIKELIHOOD
Minimum loglikelihood to avoid -infinity.- See Also:
- Constant Field Values
-
SOFT_TYPE
public static final SimpleTypeInformation<double[]> SOFT_TYPE
Soft assignment result type.
-
-
Constructor Detail
-
EM
public EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory)
Constructor.- Parameters:
k
- k parameterdelta
- delta parametermfactory
- EM cluster model factory
-
EM
public EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory, int maxiter, boolean soft)
Constructor.- Parameters:
k
- k parameterdelta
- delta parametermfactory
- EM cluster model factorymaxiter
- Maximum number of iterationssoft
- Include soft assignments
-
EM
public EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory, int maxiter, double prior, boolean soft)
Constructor.- Parameters:
k
- k parameterdelta
- delta parametermfactory
- EM cluster model factorymaxiter
- Maximum number of iterationsprior
- MAP priorsoft
- Include soft assignments
-
EM
public EM(int k, double delta, EMClusterModelFactory<? super O,M> mfactory, int miniter, int maxiter, double prior, boolean soft)
Constructor.- Parameters:
k
- k parameterdelta
- delta parametermfactory
- EM cluster model factoryminiter
- Minimum number of iterationsmaxiter
- Maximum number of iterationsprior
- MAP priorsoft
- Include soft assignments
-
-
Method Detail
-
getInputTypeRestriction
public TypeInformation[] getInputTypeRestriction()
Description copied from interface:Algorithm
Get the input type restriction used for negotiating the data query.- Specified by:
getInputTypeRestriction
in interfaceAlgorithm
- Returns:
- Type restriction
-
run
public Clustering<M> run(Relation<O> relation)
Performs the EM clustering algorithm on the given database.Finally a hard clustering is provided where each clusters gets assigned the points exhibiting the highest probability to belong to this cluster. But still, the database objects hold associated the complete probability-vector for all models.
- Parameters:
relation
- Relation- Returns:
- Clustering result
-
recomputeCovarianceMatrices
public static <O> void recomputeCovarianceMatrices(Relation<? extends O> relation, WritableDataStore<double[]> probClusterIGivenX, java.util.List<? extends EMClusterModel<? super O,?>> models, double prior)
Recompute the covariance matrixes.- Type Parameters:
O
- Object type- Parameters:
relation
- Vector dataprobClusterIGivenX
- Object probabilitiesmodels
- Cluster models to updateprior
- MAP prior (use 0 for MLE)
-
assignProbabilitiesToInstances
public static <O> double assignProbabilitiesToInstances(Relation<? extends O> relation, java.util.List<? extends EMClusterModel<? super O,?>> models, WritableDataStore<double[]> probClusterIGivenX, WritableDoubleDataStore loglikelihoods)
Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.Computed as the sum of the logarithms of the prior probability of each instance.
- Type Parameters:
O
- Object type- Parameters:
relation
- the database used for assignment to instancesmodels
- Cluster modelsprobClusterIGivenX
- Output storage for cluster probabilitiesloglikelihoods
- Per-object log likelihood, for EM Outlier; may benull
if not used- Returns:
- the expectation value of the current mixture of distributions
-
logSumExp
public static double logSumExp(double[] x)
Compute log(sum(exp(x_i)), with attention to numerical issues.- Parameters:
x
- Input- Returns:
- Result
-
logSumExp
protected static double logSumExp(double a, double b)
Compute log(exp(a)+exp(b)), with attention to numerical issues.- Parameters:
a
- Input 1b
- Input 2- Returns:
- Result
-
-