Class CFTree<L extends ClusterFeature>


  • @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: An Efficient Data Clustering Method for Very Large Databases",booktitle="Proc. 1996 ACM SIGMOD International Conference on Management of Data",url="https://doi.org/10.1145/233269.233324",bibkey="DBLP:conf/sigmod/ZhangRL96") @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: A New Data Clustering Algorithm and Its Applications",booktitle="Data Min. Knowl. Discovery",url="https://doi.org/10.1023/A:1009783824328",bibkey="DBLP:journals/datamine/ZhangRL97") @Reference(authors="Andreas Lang and Erich Schubert",title="BETULA: Numerically Stable CF-Trees for BIRCH Clustering",booktitle="Int. Conf on Similarity Search and Applications",url="https://doi.org/10.1007/978-3-030-60936-8_22",bibkey="DBLP:conf/sisap/LangS20") @Reference(authors="Andreas Lang and Erich Schubert",title="BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees",booktitle="Information Systems",url="https://doi.org/10.1016/j.is.2021.101918",bibkey="DBLP:journals/is/LangS22")
    public class CFTree<L extends ClusterFeature>
    extends java.lang.Object
    Partial implementation of the CFTree as used by BIRCH and BETULA.

    Important differences:

    1. Leaf nodes and directory nodes have the same capacity
    2. Condensing and memory limits are not implemented
    3. Merging refinement (merge-resplit) is not implemented
    Because we want to be able to track the cluster assignments of all data points easily, we need to store the point IDs, and it is not possible to implement the originally proposed page size management at the same time.

    Condensing and merging refinement are possible, and improvements to this code are welcome - please send a pull request!

    References:

    T. Zhang, R. Ramakrishnan, M. Livny
    BIRCH: An Efficient Data Clustering Method for Very Large Databases Proc. 1996 ACM SIGMOD International Conference on Management of Data

    T. Zhang, R. Ramakrishnan, M. Livny
    BIRCH: A New Data Clustering Algorithm and Its Applications
    Data Min. Knowl. Discovery

    Andreas Lang and Erich Schubert
    BETULA: Numerically Stable CF-Trees for BIRCH Clustering
    Int. Conf on Similarity Search and Applications 2020

    Andreas Lang and Erich Schubert
    BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees
    Information Systems

    Since:
    0.8.0
    Author:
    Erich Schubert, Andreas Lang
    • Field Detail

      • LOG

        public static final Logging LOG
        Class logger.
      • thresholdsq

        double thresholdsq
        Squared maximum radius threshold of a clustering feature.
      • capacity

        int capacity
        Capacity of a node.
      • leaves

        int leaves
        Leaf node counter.
      • maxleaves

        int maxleaves
        Maximum number of leaves allowed
      • rebuildstat

        int rebuildstat
        Number of tree rebuilds
      • diststat

        long diststat
        Number of distance calculations
      • absstat

        long absstat
        Number ob absorption calculations
    • Constructor Detail

      • CFTree

        public CFTree​(ClusterFeature.Factory<L> factory,
                      CFDistance dist,
                      CFDistance abs,
                      double threshold,
                      int capacity,
                      CFTree.Threshold tCriterium,
                      int maxleaves,
                      boolean storeIds)
        Constructor.
        Parameters:
        factory - Cluster feature factory
        dist - Distance function to choose nearest
        abs - Absorption criterion
        threshold - Distance threshold
        capacity - Maximum number of leaves
        tCriterium - threshold adjustment rule
        maxleaves - Maximum number of leaves
        storeIds - Store object ids
    • Method Detail

      • insert

        public void insert​(NumberVector nv,
                           DBIDRef dbid)
        Insert a data point into the tree.
        Parameters:
        nv - Object data
        dbid - Object id
      • rebuildTree

        protected void rebuildTree()
        Rebuild the CFTree to condense it to approximately half the size.
      • estimateThreshold

        private void estimateThreshold​(CFNode<L> current,
                                       java.util.ArrayList<L> cfs,
                                       double[] thresholds)
      • insert

        private CFNode<L> insert​(CFNode<L> node,
                                 NumberVector nv,
                                 DBIDRef dbid)
        Recursive insertion.
        Parameters:
        node - Current node
        nv - Object data
        dbid - Object id
        Returns:
        New sibling, if the node was split.
      • split

        private CFNode<L> split​(CFNode<L> node,
                                AsClusterFeature newchild)
        Split an overfull node.
        Parameters:
        node - Node to split
        newchild - Additional child
        Returns:
        New sibling of node
      • insert

        private CFNode<L> insert​(CFNode<L> node,
                                 AsClusterFeature nleaf)
        Recursive insertion.
        Parameters:
        node - Current node
        nleaf - Leaf entry to add.
        Returns:
        New sibling, if the node was split.
      • leafIterator

        public CFTree.LeafIterator<L> leafIterator()
        Get an iterator over the leaf nodes.
        Returns:
        Leaf node iterator.
      • getLeaves

        public java.util.ArrayList<L> getLeaves()
        Extract the leaves of the tree.
        Returns:
        Leaves
      • printDebug

        protected static java.lang.StringBuilder printDebug​(java.lang.StringBuilder buf,
                                                            ClusterFeature n,
                                                            int d)
        Utility function for debugging.
        Parameters:
        buf - Output buffer
        n - Current node
        d - Depth
        Returns:
        Output buffer
      • sqdistance

        private double sqdistance​(ClusterFeature cf1,
                                  ClusterFeature cf2)
        Updates statistics and calculates distance between two Cluster Features based on selected criteria.
        Parameters:
        cf1 - First Cluster Feature
        cf2 - Second Cluster Feature
        Returns:
        Distance
      • sqdistance

        private double sqdistance​(NumberVector nv,
                                  ClusterFeature cf)
        Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.
        Parameters:
        nv - Number Vector
        cf - Cluster Feature
        Returns:
        Distance
      • sqabsorption

        private double sqabsorption​(ClusterFeature cf1,
                                    ClusterFeature cf2)
        Updates statistics and calculates distance between two Cluster Features based on selected criteria.
        Parameters:
        cf1 - First Cluster Feature
        cf2 - Second Cluster Feature
        Returns:
        Distance
      • sqabsorption

        private double sqabsorption​(NumberVector nv,
                                    ClusterFeature cf)
        Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.
        Parameters:
        nv - Number Vector
        cf - Cluster Feature
        Returns:
        Distance
      • numLeaves

        public int numLeaves()
        Get the number of leaves in the tree.
        Returns:
        number of leaves
      • getRoot

        public CFNode<L> getRoot()
        Get the trees root node.
        Returns:
        root node
      • getCapacity

        public int getCapacity()
        Get the tree capacity
        Returns:
        tree capacity
      • getDBIDs

        public DBIDs getDBIDs​(ClusterFeature cf)
        Get the DBIDs of a cluster feature (if stored).
        Parameters:
        cf - Cluster feature
        Returns:
        Object ids