Class CFTree<L extends ClusterFeature>
- java.lang.Object
-
- elki.index.tree.betula.CFTree<L>
-
@Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: An Efficient Data Clustering Method for Very Large Databases",booktitle="Proc. 1996 ACM SIGMOD International Conference on Management of Data",url="https://doi.org/10.1145/233269.233324",bibkey="DBLP:conf/sigmod/ZhangRL96") @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: A New Data Clustering Algorithm and Its Applications",booktitle="Data Min. Knowl. Discovery",url="https://doi.org/10.1023/A:1009783824328",bibkey="DBLP:journals/datamine/ZhangRL97") @Reference(authors="Andreas Lang and Erich Schubert",title="BETULA: Numerically Stable CF-Trees for BIRCH Clustering",booktitle="Int. Conf on Similarity Search and Applications",url="https://doi.org/10.1007/978-3-030-60936-8_22",bibkey="DBLP:conf/sisap/LangS20") @Reference(authors="Andreas Lang and Erich Schubert",title="BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees",booktitle="Information Systems",url="https://doi.org/10.1016/j.is.2021.101918",bibkey="DBLP:journals/is/LangS22") public class CFTree<L extends ClusterFeature> extends java.lang.Object
Partial implementation of the CFTree as used by BIRCH and BETULA.Important differences:
- Leaf nodes and directory nodes have the same capacity
- Condensing and memory limits are not implemented
- Merging refinement (merge-resplit) is not implemented
Condensing and merging refinement are possible, and improvements to this code are welcome - please send a pull request!
References:
T. Zhang, R. Ramakrishnan, M. Livny
BIRCH: An Efficient Data Clustering Method for Very Large Databases Proc. 1996 ACM SIGMOD International Conference on Management of DataT. Zhang, R. Ramakrishnan, M. Livny
BIRCH: A New Data Clustering Algorithm and Its Applications
Data Min. Knowl. DiscoveryAndreas Lang and Erich Schubert
BETULA: Numerically Stable CF-Trees for BIRCH Clustering
Int. Conf on Similarity Search and Applications 2020Andreas Lang and Erich Schubert
BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees
Information Systems- Since:
- 0.8.0
- Author:
- Erich Schubert, Andreas Lang
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CFTree.Factory<L extends ClusterFeature>
CF-Tree Factory.static class
CFTree.LeafIterator<L extends ClusterFeature>
Iterator over leaf nodes.static class
CFTree.Threshold
Threshold update strategy.
-
Field Summary
Fields Modifier and Type Field Description (package private) CFDistance
abs
Absorption criterion(package private) long
absstat
Number ob absorption calculations(package private) int
capacity
Capacity of a node.(package private) CFDistance
dist
Cluster distance(package private) long
diststat
Number of distance calculations(package private) ClusterFeature.Factory<L>
factory
Cluster feature factory(package private) java.util.Map<ClusterFeature,ArrayModifiableDBIDs>
idmap
Stored leaf entry to dbid relation(package private) int
leaves
Leaf node counter.static Logging
LOG
Class logger.(package private) int
maxleaves
Maximum number of leaves allowed(package private) int
rebuildstat
Number of tree rebuilds(package private) CFNode<L>
root
Current root node.(package private) CFTree.Threshold
tCriterium
Threshold heuristic strategy.(package private) double
thresholdsq
Squared maximum radius threshold of a clustering feature.
-
Constructor Summary
Constructors Constructor Description CFTree(ClusterFeature.Factory<L> factory, CFDistance dist, CFDistance abs, double threshold, int capacity, CFTree.Threshold tCriterium, int maxleaves, boolean storeIds)
Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
estimateThreshold(CFNode<L> current, java.util.ArrayList<L> cfs, double[] thresholds)
L
findLeaf(NumberVector nv)
Find the leaf of a cluster, to get the final cluster assignment.private L
findLeaf(CFNode<L> node, NumberVector nv)
Find the leaf of a cluster, to get the final cluster assignment.int
getCapacity()
Get the tree capacityDBIDs
getDBIDs(ClusterFeature cf)
Get the DBIDs of a cluster feature (if stored).java.util.ArrayList<L>
getLeaves()
Extract the leaves of the tree.CFNode<L>
getRoot()
Get the trees root node.void
insert(NumberVector nv, DBIDRef dbid)
Insert a data point into the tree.private CFNode<L>
insert(CFNode<L> node, NumberVector nv, DBIDRef dbid)
Recursive insertion.private CFNode<L>
insert(CFNode<L> node, AsClusterFeature nleaf)
Recursive insertion.CFTree.LeafIterator<L>
leafIterator()
Get an iterator over the leaf nodes.int
numLeaves()
Get the number of leaves in the tree.protected static java.lang.StringBuilder
printDebug(java.lang.StringBuilder buf, ClusterFeature n, int d)
Utility function for debugging.protected void
rebuildTree()
Rebuild the CFTree to condense it to approximately half the size.private CFNode<L>
split(CFNode<L> node, AsClusterFeature newchild)
Split an overfull node.private double
sqabsorption(NumberVector nv, ClusterFeature cf)
Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.private double
sqabsorption(ClusterFeature cf1, ClusterFeature cf2)
Updates statistics and calculates distance between two Cluster Features based on selected criteria.private double
sqdistance(NumberVector nv, ClusterFeature cf)
Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.private double
sqdistance(ClusterFeature cf1, ClusterFeature cf2)
Updates statistics and calculates distance between two Cluster Features based on selected criteria.
-
-
-
Field Detail
-
LOG
public static final Logging LOG
Class logger.
-
factory
ClusterFeature.Factory<L extends ClusterFeature> factory
Cluster feature factory
-
dist
CFDistance dist
Cluster distance
-
abs
CFDistance abs
Absorption criterion
-
tCriterium
CFTree.Threshold tCriterium
Threshold heuristic strategy.
-
thresholdsq
double thresholdsq
Squared maximum radius threshold of a clustering feature.
-
capacity
int capacity
Capacity of a node.
-
root
CFNode<L extends ClusterFeature> root
Current root node.
-
leaves
int leaves
Leaf node counter.
-
maxleaves
int maxleaves
Maximum number of leaves allowed
-
rebuildstat
int rebuildstat
Number of tree rebuilds
-
diststat
long diststat
Number of distance calculations
-
absstat
long absstat
Number ob absorption calculations
-
idmap
java.util.Map<ClusterFeature,ArrayModifiableDBIDs> idmap
Stored leaf entry to dbid relation
-
-
Constructor Detail
-
CFTree
public CFTree(ClusterFeature.Factory<L> factory, CFDistance dist, CFDistance abs, double threshold, int capacity, CFTree.Threshold tCriterium, int maxleaves, boolean storeIds)
Constructor.- Parameters:
factory
- Cluster feature factorydist
- Distance function to choose nearestabs
- Absorption criterionthreshold
- Distance thresholdcapacity
- Maximum number of leavestCriterium
- threshold adjustment rulemaxleaves
- Maximum number of leavesstoreIds
- Store object ids
-
-
Method Detail
-
insert
public void insert(NumberVector nv, DBIDRef dbid)
Insert a data point into the tree.- Parameters:
nv
- Object datadbid
- Object id
-
rebuildTree
protected void rebuildTree()
Rebuild the CFTree to condense it to approximately half the size.
-
estimateThreshold
private void estimateThreshold(CFNode<L> current, java.util.ArrayList<L> cfs, double[] thresholds)
-
insert
private CFNode<L> insert(CFNode<L> node, NumberVector nv, DBIDRef dbid)
Recursive insertion.- Parameters:
node
- Current nodenv
- Object datadbid
- Object id- Returns:
- New sibling, if the node was split.
-
findLeaf
public L findLeaf(NumberVector nv)
Find the leaf of a cluster, to get the final cluster assignment.In contrast to
insert(elki.data.NumberVector, elki.database.ids.DBIDRef)
, this does not modify the tree.- Parameters:
nv
- Object data- Returns:
- Leaf this vector should be assigned to
-
findLeaf
private L findLeaf(CFNode<L> node, NumberVector nv)
Find the leaf of a cluster, to get the final cluster assignment.In contrast to
insert(elki.data.NumberVector, elki.database.ids.DBIDRef)
, this does not modify the tree.TODO: allow "outliers"?
- Parameters:
node
- Current nodenv
- Object data- Returns:
- Leaf this vector should be assigned to
-
split
private CFNode<L> split(CFNode<L> node, AsClusterFeature newchild)
Split an overfull node.- Parameters:
node
- Node to splitnewchild
- Additional child- Returns:
- New sibling of
node
-
insert
private CFNode<L> insert(CFNode<L> node, AsClusterFeature nleaf)
Recursive insertion.- Parameters:
node
- Current nodenleaf
- Leaf entry to add.- Returns:
- New sibling, if the node was split.
-
leafIterator
public CFTree.LeafIterator<L> leafIterator()
Get an iterator over the leaf nodes.- Returns:
- Leaf node iterator.
-
getLeaves
public java.util.ArrayList<L> getLeaves()
Extract the leaves of the tree.- Returns:
- Leaves
-
printDebug
protected static java.lang.StringBuilder printDebug(java.lang.StringBuilder buf, ClusterFeature n, int d)
Utility function for debugging.- Parameters:
buf
- Output buffern
- Current noded
- Depth- Returns:
- Output buffer
-
sqdistance
private double sqdistance(ClusterFeature cf1, ClusterFeature cf2)
Updates statistics and calculates distance between two Cluster Features based on selected criteria.- Parameters:
cf1
- First Cluster Featurecf2
- Second Cluster Feature- Returns:
- Distance
-
sqdistance
private double sqdistance(NumberVector nv, ClusterFeature cf)
Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.- Parameters:
nv
- Number Vectorcf
- Cluster Feature- Returns:
- Distance
-
sqabsorption
private double sqabsorption(ClusterFeature cf1, ClusterFeature cf2)
Updates statistics and calculates distance between two Cluster Features based on selected criteria.- Parameters:
cf1
- First Cluster Featurecf2
- Second Cluster Feature- Returns:
- Distance
-
sqabsorption
private double sqabsorption(NumberVector nv, ClusterFeature cf)
Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.- Parameters:
nv
- Number Vectorcf
- Cluster Feature- Returns:
- Distance
-
numLeaves
public int numLeaves()
Get the number of leaves in the tree.- Returns:
- number of leaves
-
getCapacity
public int getCapacity()
Get the tree capacity- Returns:
- tree capacity
-
getDBIDs
public DBIDs getDBIDs(ClusterFeature cf)
Get the DBIDs of a cluster feature (if stored).- Parameters:
cf
- Cluster feature- Returns:
- Object ids
-
-