Package elki.clustering.correlation
Class CASH
- java.lang.Object
-
- elki.clustering.correlation.CASH
-
- All Implemented Interfaces:
Algorithm
,ClusteringAlgorithm<Clustering<Model>>
@Title("CASH: Robust clustering in arbitrarily oriented subspaces") @Description("Subspace clustering algorithm based on the Hough transform.") @Reference(authors="Elke Achtert, Christian B\u00f6hm, J\u00f6rn David, Peer Kr\u00f6ger, Arthur Zimek", title="Robust clustering in arbitrarily oriented subspaces", booktitle="Proc. 8th SIAM Int. Conf. on Data Mining (SDM\'08)", url="https://doi.org/10.1137/1.9781611972788.69", bibkey="DBLP:conf/sdm/AchtertBDKZ08") public class CASH extends java.lang.Object implements ClusteringAlgorithm<Clustering<Model>>
The CASH algorithm is a subspace clustering algorithm based on the Hough transform.Reference:
Elke Achtert, Christian Böhm, Jörn David, Peer Kröger, Arthur Zimek
Robust clustering in arbitrarily oriented subspaces.
In Proc. 8th SIAM Int. Conf. on Data Mining (SDM'08)- Since:
- 0.1
- Author:
- Elke Achtert
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CASH.Par
Parameterization class.-
Nested classes/interfaces inherited from interface elki.Algorithm
Algorithm.Utils
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
adjust
Apply adjustment heuristic for interval choosing.private Relation<ParameterizationFunction>
fulldatabase
The entire relation.protected double
jitter
Maximum jitter for distance values.private static Logging
LOG
The logger for this class.protected int
maxLevel
Maximum level for splitting the hypercube.protected int
minDim
Minimum dimensionality of the subspaces to be foundprotected int
minPts
Threshold for minimum number of points in a clusterprivate int
noiseDim
Holds the dimensionality for noise.private ModifiableDBIDs
processedIDs
Holds a set of processed ids.
-
Constructor Summary
Constructors Constructor Description CASH(int minPts, int maxLevel, int minDim, double jitter, boolean adjust)
Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private MaterializedRelation<ParameterizationFunction>
buildDB(int dim, double[][] basis, DBIDs ids, Relation<ParameterizationFunction> relation)
Builds a dim-1 dimensional database where the objects are projected into the specified subspace.private Relation<DoubleVector>
buildDerivatorDB(Relation<ParameterizationFunction> relation, DBIDs ids)
Builds a database for the derivator consisting of the ids in the specified interval.private double[][]
determineBasis(double[] alpha)
Determines a basis defining a subspace described by the specified alpha values.private double[]
determineMinMaxDistance(Relation<ParameterizationFunction> relation, int dimensionality)
Determines the minimum and maximum function value of all parameterization functions stored in the specified database.private CASHInterval
determineNextIntervalAtMaxLevel(ObjectHeap<CASHInterval> heap)
Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.private static int
dimensionality(Relation<ParameterizationFunction> relation)
Get the dimensionality of a vector field.private CASHInterval
doDetermineNextIntervalAtMaxLevel(ObjectHeap<CASHInterval> heap)
Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objectsprivate Clustering<Model>
doRun(Relation<ParameterizationFunction> relation, FiniteProgress progress)
Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.TypeInformation[]
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.private void
initHeap(ObjectHeap<CASHInterval> heap, Relation<ParameterizationFunction> relation, int dim, DBIDs ids)
Initializes the heap with the root intervals.private Relation<ParameterizationFunction>
preprocess(Relation<? extends NumberVector> vrel)
Preprocess the dataset, precomputing the parameterization functions.private ParameterizationFunction
project(double[][] basis, ParameterizationFunction f)
Projects the specified parameterization function into the subspace described by the given basis.Clustering<Model>
run(Relation<? extends NumberVector> rel)
Run CASH on the relation.private double[][]
runDerivator(Relation<ParameterizationFunction> relation, int dim, CASHInterval interval, ModifiableDBIDs outids)
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.private LinearEquationSystem
runDerivator(Relation<ParameterizationFunction> relation, int dimensionality, DBIDs ids)
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface elki.clustering.ClusteringAlgorithm
autorun
-
-
-
-
Field Detail
-
LOG
private static final Logging LOG
The logger for this class.
-
minPts
protected int minPts
Threshold for minimum number of points in a cluster
-
maxLevel
protected int maxLevel
Maximum level for splitting the hypercube.
-
minDim
protected int minDim
Minimum dimensionality of the subspaces to be found
-
jitter
protected double jitter
Maximum jitter for distance values.
-
adjust
protected boolean adjust
Apply adjustment heuristic for interval choosing.
-
noiseDim
private int noiseDim
Holds the dimensionality for noise.
-
processedIDs
private ModifiableDBIDs processedIDs
Holds a set of processed ids.
-
fulldatabase
private Relation<ParameterizationFunction> fulldatabase
The entire relation.
-
-
Method Detail
-
run
public Clustering<Model> run(Relation<? extends NumberVector> rel)
Run CASH on the relation.- Parameters:
rel
- Relation- Returns:
- Clustering result
-
preprocess
private Relation<ParameterizationFunction> preprocess(Relation<? extends NumberVector> vrel)
Preprocess the dataset, precomputing the parameterization functions.- Parameters:
vrel
- Vector relation- Returns:
- Preprocessed relation
-
doRun
private Clustering<Model> doRun(Relation<ParameterizationFunction> relation, FiniteProgress progress)
Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.- Parameters:
relation
- the Relation to run the CASH algorithm onprogress
- the progress object for verbose messages- Returns:
- a mapping of subspace dimensionalities to clusters
-
dimensionality
private static int dimensionality(Relation<ParameterizationFunction> relation)
Get the dimensionality of a vector field.- Parameters:
relation
- Relation- Returns:
- Dimensionality
-
initHeap
private void initHeap(ObjectHeap<CASHInterval> heap, Relation<ParameterizationFunction> relation, int dim, DBIDs ids)
Initializes the heap with the root intervals.- Parameters:
heap
- the heap to be initializedrelation
- the database storing the parameterization functionsdim
- the dimensionality of the databaseids
- the ids of the database
-
buildDB
private MaterializedRelation<ParameterizationFunction> buildDB(int dim, double[][] basis, DBIDs ids, Relation<ParameterizationFunction> relation)
Builds a dim-1 dimensional database where the objects are projected into the specified subspace.- Parameters:
dim
- the dimensionality of the databasebasis
- the basis defining the subspaceids
- the ids for the new databaserelation
- the database storing the parameterization functions- Returns:
- a dim-1 dimensional database where the objects are projected into the specified subspace
-
project
private ParameterizationFunction project(double[][] basis, ParameterizationFunction f)
Projects the specified parameterization function into the subspace described by the given basis.- Parameters:
basis
- the basis defining he subspacef
- the parameterization function to be projected- Returns:
- the projected parameterization function
-
determineBasis
private double[][] determineBasis(double[] alpha)
Determines a basis defining a subspace described by the specified alpha values.- Parameters:
alpha
- the alpha values- Returns:
- a basis defining a subspace described by the specified alpha values
-
determineNextIntervalAtMaxLevel
private CASHInterval determineNextIntervalAtMaxLevel(ObjectHeap<CASHInterval> heap)
Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.- Parameters:
heap
- the heap storing the intervals- Returns:
- the next ''best'' interval at maximum level
-
doDetermineNextIntervalAtMaxLevel
private CASHInterval doDetermineNextIntervalAtMaxLevel(ObjectHeap<CASHInterval> heap)
Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects- Parameters:
heap
- the heap storing the intervals- Returns:
- the next ''best'' interval at maximum level
-
determineMinMaxDistance
private double[] determineMinMaxDistance(Relation<ParameterizationFunction> relation, int dimensionality)
Determines the minimum and maximum function value of all parameterization functions stored in the specified database.- Parameters:
relation
- the database containing the parameterization functions.dimensionality
- the dimensionality of the database- Returns:
- an array containing the minimum and maximum function value of all parameterization functions stored in the specified database
-
runDerivator
private double[][] runDerivator(Relation<ParameterizationFunction> relation, int dim, CASHInterval interval, ModifiableDBIDs outids)
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.- Parameters:
relation
- the database containing the parameterization functionsinterval
- the interval to build the modeldim
- the dimensionality of the databaseoutids
- an empty set to assign the ids- Returns:
- a basis of the found subspace
-
runDerivator
private LinearEquationSystem runDerivator(Relation<ParameterizationFunction> relation, int dimensionality, DBIDs ids)
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.- Parameters:
relation
- the database containing the parameterization functionsids
- the ids to build the modeldimensionality
- the dimensionality of the subspace- Returns:
- a basis of the found subspace
-
buildDerivatorDB
private Relation<DoubleVector> buildDerivatorDB(Relation<ParameterizationFunction> relation, DBIDs ids)
Builds a database for the derivator consisting of the ids in the specified interval.- Parameters:
relation
- the database storing the parameterization functionsids
- the ids to build the database from- Returns:
- a database for the derivator consisting of the ids in the specified interval
-
getInputTypeRestriction
public TypeInformation[] getInputTypeRestriction()
Description copied from interface:Algorithm
Get the input type restriction used for negotiating the data query.- Specified by:
getInputTypeRestriction
in interfaceAlgorithm
- Returns:
- Type restriction
-
-