Class BIRCHLloydKMeans

  • All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<MeanModel>>

    @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: An Efficient Data Clustering Method for Very Large Databases",booktitle="Proc. 1996 ACM SIGMOD International Conference on Management of Data",url="https://doi.org/10.1145/233269.233324",bibkey="DBLP:conf/sigmod/ZhangRL96") @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: A New Data Clustering Algorithm and Its Applications",booktitle="Data Min. Knowl. Discovery",url="https://doi.org/10.1023/A:1009783824328",bibkey="DBLP:journals/datamine/ZhangRL97")
    public class BIRCHLloydKMeans
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<MeanModel>>
    BIRCH-based clustering algorithm that simply treats the leafs of the CFTree as clusters.

    References:

    T. Zhang, R. Ramakrishnan, M. Livny
    BIRCH: An Efficient Data Clustering Method for Very Large Databases Proc. 1996 ACM SIGMOD International Conference on Management of Data

    T. Zhang, R. Ramakrishnan, M. Livny
    BIRCH: A New Data Clustering Algorithm and Its Applications Data. Min. Knowl. Discovery

    Since:
    0.7.5
    Author:
    Erich Schubert
    • Field Detail

      • LOG

        private static final Logging LOG
        Class logger.
      • k

        int k
        Number of cluster centers to initialize.
      • maxiter

        int maxiter
        Maximum number of iterations.
    • Constructor Detail

      • BIRCHLloydKMeans

        public BIRCHLloydKMeans​(CFTree.Factory cffactory,
                                int k,
                                int maxiter,
                                BIRCHKMeansPlusPlus initialization)
        Constructor.
        Parameters:
        cffactory - CFTree factory
        k - Number of clusters
        maxiter - Maximum number of iterations
        initialization - Initialization method for k-means
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Returns:
        Type restriction
      • kmeans

        private double[][] kmeans​(double[][] cfmeans,
                                  ClusteringFeature[] cfs,
                                  int[] assignment,
                                  int[] weights)
        Perform k-means clustering.
        Parameters:
        cfmeans - Cluster feature means
        cfs - Cluster features
        assignment - Cluster assignment of each CF
        weights - Cluster weight output
        Returns:
        Cluster means
      • means

        private double[][] means​(int[] assignment,
                                 double[][] means,
                                 ClusteringFeature[] cfs,
                                 int[] weights)
        Calculate means of clusters.
        Parameters:
        assignment - Cluster assignment
        means - Means of clusters
        cfs - Clustering features
        Returns:
        Means of clusters.
      • assignToNearestCluster

        private int assignToNearestCluster​(int[] assignment,
                                           double[][] means,
                                           double[][] cfmeans,
                                           ClusteringFeature[] cfs,
                                           int[] weights)
        Assign each element to nearest cluster.
        Parameters:
        assignment - Current cluster assignment
        means - k-means cluster means
        cfmeans - Cluster feature means
        cfs - Cluster features
        weights - Cluster weights (output)
        Returns:
        Number of reassigned elements
      • distance

        protected double distance​(NumberVector x,
                                  double[] y)
        Compute a distance (and count the distance computations).
        Parameters:
        x - First object
        y - Second object
        Returns:
        Distance
      • distance

        protected double distance​(double[] x,
                                  double[] y)
        Compute a distance (and count the distance computations).
        Parameters:
        x - First object
        y - Second object
        Returns:
        Distance
      • calculateVariances

        private double[] calculateVariances​(int[] assignment,
                                            double[][] means,
                                            ClusteringFeature[] cfs,
                                            int[] weights)
        Calculate variance of clusters based on clustering features.

        The result is only correct after updating the means!

        Parameters:
        assignment - Cluster assignment of CFs
        means - Cluster means
        cfs - CF leaves
        weights - Cluster weights
        Returns:
        Per-cluster variances