Package elki.clustering.subspace
Class P3C
- java.lang.Object
-
- elki.clustering.subspace.P3C
-
- All Implemented Interfaces:
Algorithm
,ClusteringAlgorithm<Clustering<SubspaceModel>>
,SubspaceClusteringAlgorithm<SubspaceModel>
@Title("P3C: A Robust Projected Clustering Algorithm.") @Reference(authors="Gabriela Moise, J\u00f6rg Sander, Martin Ester", title="P3C: A Robust Projected Clustering Algorithm", booktitle="Proc. Sixth International Conference on Data Mining (ICDM \'06)", url="https://doi.org/10.1109/ICDM.2006.123", bibkey="DBLP:conf/icdm/MoiseSE06") @Priority(190) public class P3C extends java.lang.Object implements SubspaceClusteringAlgorithm<SubspaceModel>
P3C: A Robust Projected Clustering Algorithm.Reference:
Gabriela Moise, Jörg Sander, Martin Ester
P3C: A Robust Projected Clustering Algorithm
In: Proc. Sixth International Conference on Data Mining (ICDM '06)This is not a complete implementation of P3C, but good enough for most users. Improvements are welcome. The most obviously missing step is section 3.5 of P3C, where the cluster subspaces are refined.
- Since:
- 0.6.0
- Author:
- Florian Nuecke, Erich Schubert
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
P3C.ClusterCandidate
This class is used to represent potential clusters.static class
P3C.Par
Parameterization class.private static class
P3C.Signature
P3C Cluster signature.-
Nested classes/interfaces inherited from interface elki.Algorithm
Algorithm.Utils
-
-
Field Summary
Fields Modifier and Type Field Description protected double
alpha
Alpha threshold for testing.protected double
emDelta
Threshold when to stop EM iterations.private static Logging
LOG
The logger for this class.protected int
maxEmIterations
Maximum number of iterations for the EM step.protected int
minClusterSize
Minimum cluster size for noise flagging.protected double
poissonThreshold
Parameter for the Poisson test threshold.
-
Constructor Summary
Constructors Constructor Description P3C(double alpha, double poissonThreshold, int maxEmIterations, double emDelta, int minClusterSize)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
assignUnassigned(Relation<? extends NumberVector> relation, WritableDataStore<double[]> probClusterIGivenX, java.util.List<MultivariateGaussianModel> models, ModifiableDBIDs unassigned)
Assign unassigned objects to best candidate based on shortest Mahalanobis distance.private int
chiSquaredUniformTest(SetDBIDs[] parts, long[] marked, int card)
Performs a ChiSquared test to determine whether an attribute has a uniform distribution.private void
computeFuzzyMembership(Relation<? extends NumberVector> relation, java.util.ArrayList<P3C.Signature> clusterCores, ModifiableDBIDs unassigned, WritableDataStore<double[]> probClusterIGivenX, java.util.List<MultivariateGaussianModel> models, int dim)
Computes a fuzzy membership with the weights based on which cluster cores each data point is part of.private java.util.ArrayList<P3C.Signature>
constructOneSignatures(SetDBIDs[][] partitions, long[][] markers)
Construct the 1-signatures by merging adjacent dense bins.private void
findOutliers(Relation<? extends NumberVector> relation, java.util.List<MultivariateGaussianModel> models, java.util.ArrayList<P3C.ClusterCandidate> clusterCandidates, ModifiableDBIDs noise)
Performs outlier detection by testing the Mahalanobis distance of each point in a cluster against the critical value of the ChiSquared distribution with as many degrees of freedom as the cluster has relevant attributes.TypeInformation[]
getInputTypeRestriction()
Get the input type restriction used for negotiating the data query.private java.util.ArrayList<P3C.ClusterCandidate>
hardClustering(WritableDataStore<double[]> probClusterIGivenX, java.util.List<P3C.Signature> clusterCores, DBIDs dbids)
Creates a hard clustering from the specified soft membership matrix.private java.util.ArrayList<P3C.Signature>
mergeClusterCores(int binCount, java.util.ArrayList<P3C.Signature> signatures)
Merge 1-signatures into p-signatures.protected P3C.Signature
mergeSignatures(P3C.Signature first, P3C.Signature second, int numBins)
Generates a merged signature of this and another one, where the other signature must be a 1-signature.private SetDBIDs[][]
partitionData(Relation<? extends NumberVector> relation, int bins)
Partition the data set intobins
bins in each dimension independently.private java.util.ArrayList<P3C.Signature>
pruneRedundantClusterCores(java.util.ArrayList<P3C.Signature> clusterCores)
Prune redundant cluster coresClustering<SubspaceModel>
run(Relation<? extends NumberVector> relation)
Performs the P3C algorithm on the given Database.protected HashSetModifiableDBIDs
unionDBIDs(DBIDs[] parts, int start, int end)
Compute the union of multiple DBID sets.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface elki.clustering.ClusteringAlgorithm
autorun
-
-
-
-
Field Detail
-
LOG
private static final Logging LOG
The logger for this class.
-
poissonThreshold
protected double poissonThreshold
Parameter for the Poisson test threshold.
-
maxEmIterations
protected int maxEmIterations
Maximum number of iterations for the EM step.
-
emDelta
protected double emDelta
Threshold when to stop EM iterations.
-
minClusterSize
protected int minClusterSize
Minimum cluster size for noise flagging. (Not existing in the original publication).
-
alpha
protected double alpha
Alpha threshold for testing.
-
-
Constructor Detail
-
P3C
public P3C(double alpha, double poissonThreshold, int maxEmIterations, double emDelta, int minClusterSize)
Constructor.- Parameters:
alpha
- ChiSquared test thresholdpoissonThreshold
- Poisson test thresholdmaxEmIterations
- Maximum number of EM iterationsemDelta
- EM stopping thresholdminClusterSize
- Minimum cluster size
-
-
Method Detail
-
getInputTypeRestriction
public TypeInformation[] getInputTypeRestriction()
Description copied from interface:Algorithm
Get the input type restriction used for negotiating the data query.- Specified by:
getInputTypeRestriction
in interfaceAlgorithm
- Returns:
- Type restriction
-
run
public Clustering<SubspaceModel> run(Relation<? extends NumberVector> relation)
Performs the P3C algorithm on the given Database.- Parameters:
relation
- Input data
-
constructOneSignatures
private java.util.ArrayList<P3C.Signature> constructOneSignatures(SetDBIDs[][] partitions, long[][] markers)
Construct the 1-signatures by merging adjacent dense bins.- Parameters:
partitions
- Initial partitions.markers
- Markers for dense partitions.- Returns:
- 1-signatures
-
mergeClusterCores
private java.util.ArrayList<P3C.Signature> mergeClusterCores(int binCount, java.util.ArrayList<P3C.Signature> signatures)
Merge 1-signatures into p-signatures.- Parameters:
binCount
- Number of bins in each dimension.signatures
- 1-signatures- Returns:
- p-signatures
-
pruneRedundantClusterCores
private java.util.ArrayList<P3C.Signature> pruneRedundantClusterCores(java.util.ArrayList<P3C.Signature> clusterCores)
Prune redundant cluster cores- Parameters:
clusterCores
- Cluster cores- Returns:
- Pruned cluster cores
-
partitionData
private SetDBIDs[][] partitionData(Relation<? extends NumberVector> relation, int bins)
Partition the data set intobins
bins in each dimension independently.This can be used to construct a grid approximation of the data using O(d n) memory.
When a dimension is found to be constant, it will not be partitioned, but instead the corresponding array will be set to
null
.- Parameters:
relation
- Data relation to partitionbins
- Number of bins- Returns:
- Partitions of each dimension.
-
unionDBIDs
protected HashSetModifiableDBIDs unionDBIDs(DBIDs[] parts, int start, int end)
Compute the union of multiple DBID sets.- Parameters:
parts
- Parts arraystart
- Array start indexend
- Array end index (exclusive)- Returns:
- Union
-
chiSquaredUniformTest
private int chiSquaredUniformTest(SetDBIDs[] parts, long[] marked, int card)
Performs a ChiSquared test to determine whether an attribute has a uniform distribution.- Parameters:
parts
- Data partitions.marked
- the marked bins that should be ignored.card
- Cardinality- Returns:
- Position of maximum, or -1 when uniform.
-
computeFuzzyMembership
private void computeFuzzyMembership(Relation<? extends NumberVector> relation, java.util.ArrayList<P3C.Signature> clusterCores, ModifiableDBIDs unassigned, WritableDataStore<double[]> probClusterIGivenX, java.util.List<MultivariateGaussianModel> models, int dim)
Computes a fuzzy membership with the weights based on which cluster cores each data point is part of.- Parameters:
relation
- Data relationclusterCores
- the cluster cores.unassigned
- set to which to add unassigned points.probClusterIGivenX
- Membership probabilities.models
- Cluster models.dim
- Dimensionality
-
assignUnassigned
private void assignUnassigned(Relation<? extends NumberVector> relation, WritableDataStore<double[]> probClusterIGivenX, java.util.List<MultivariateGaussianModel> models, ModifiableDBIDs unassigned)
Assign unassigned objects to best candidate based on shortest Mahalanobis distance.- Parameters:
relation
- Data relationprobClusterIGivenX
- fuzzy membership matrix.models
- Cluster models.unassigned
- the list of points not yet assigned.
-
hardClustering
private java.util.ArrayList<P3C.ClusterCandidate> hardClustering(WritableDataStore<double[]> probClusterIGivenX, java.util.List<P3C.Signature> clusterCores, DBIDs dbids)
Creates a hard clustering from the specified soft membership matrix.- Parameters:
probClusterIGivenX
- the membership matrix.dbids
- mapping matrix row to DBID.- Returns:
- a hard clustering based on the matrix.
-
findOutliers
private void findOutliers(Relation<? extends NumberVector> relation, java.util.List<MultivariateGaussianModel> models, java.util.ArrayList<P3C.ClusterCandidate> clusterCandidates, ModifiableDBIDs noise)
Performs outlier detection by testing the Mahalanobis distance of each point in a cluster against the critical value of the ChiSquared distribution with as many degrees of freedom as the cluster has relevant attributes.- Parameters:
relation
- Data relationmodels
- Cluster modelsclusterCandidates
- the list of clusters to check.noise
- the set to which to add points deemed outliers.
-
mergeSignatures
protected P3C.Signature mergeSignatures(P3C.Signature first, P3C.Signature second, int numBins)
Generates a merged signature of this and another one, where the other signature must be a 1-signature.- Parameters:
first
- First signature.second
- Second signature, must be a 1-signature.numBins
- Number of bins per dimension.- Returns:
- the merged signature, or null if the merge failed.
-
-