Class P3C

  • All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<SubspaceModel>>, SubspaceClusteringAlgorithm<SubspaceModel>

    @Title("P3C: A Robust Projected Clustering Algorithm.")
    @Reference(authors="Gabriela Moise, J\u00f6rg Sander, Martin Ester",
               title="P3C: A Robust Projected Clustering Algorithm",
               booktitle="Proc. Sixth International Conference on Data Mining (ICDM \'06)",
               url="https://doi.org/10.1109/ICDM.2006.123",
               bibkey="DBLP:conf/icdm/MoiseSE06")
    @Priority(190)
    public class P3C
    extends java.lang.Object
    implements SubspaceClusteringAlgorithm<SubspaceModel>
    P3C: A Robust Projected Clustering Algorithm.

    Reference:
    Gabriela Moise, Jörg Sander, Martin Ester
    P3C: A Robust Projected Clustering Algorithm
    In: Proc. Sixth International Conference on Data Mining (ICDM '06)

    This is not a complete implementation of P3C, but good enough for most users. Improvements are welcome. The most obviously missing step is section 3.5 of P3C, where the cluster subspaces are refined.

    Since:
    0.6.0
    Author:
    Florian Nuecke, Erich Schubert
    • Field Detail

      • LOG

        private static final Logging LOG
        The logger for this class.
      • poissonThreshold

        protected double poissonThreshold
        Parameter for the Poisson test threshold.
      • maxEmIterations

        protected int maxEmIterations
        Maximum number of iterations for the EM step.
      • emDelta

        protected double emDelta
        Threshold when to stop EM iterations.
      • minClusterSize

        protected int minClusterSize
        Minimum cluster size for noise flagging. (Not existing in the original publication).
      • alpha

        protected double alpha
        Alpha threshold for testing.
    • Constructor Detail

      • P3C

        public P3C​(double alpha,
                   double poissonThreshold,
                   int maxEmIterations,
                   double emDelta,
                   int minClusterSize)
        Constructor.
        Parameters:
        alpha - ChiSquared test threshold
        poissonThreshold - Poisson test threshold
        maxEmIterations - Maximum number of EM iterations
        emDelta - EM stopping threshold
        minClusterSize - Minimum cluster size
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Returns:
        Type restriction
      • constructOneSignatures

        private java.util.ArrayList<P3C.Signature> constructOneSignatures​(SetDBIDs[][] partitions,
                                                                          long[][] markers)
        Construct the 1-signatures by merging adjacent dense bins.
        Parameters:
        partitions - Initial partitions.
        markers - Markers for dense partitions.
        Returns:
        1-signatures
      • mergeClusterCores

        private java.util.ArrayList<P3C.Signature> mergeClusterCores​(int binCount,
                                                                     java.util.ArrayList<P3C.Signature> signatures)
        Merge 1-signatures into p-signatures.
        Parameters:
        binCount - Number of bins in each dimension.
        signatures - 1-signatures
        Returns:
        p-signatures
      • pruneRedundantClusterCores

        private java.util.ArrayList<P3C.Signature> pruneRedundantClusterCores​(java.util.ArrayList<P3C.Signature> clusterCores)
        Prune redundant cluster cores
        Parameters:
        clusterCores - Cluster cores
        Returns:
        Pruned cluster cores
      • partitionData

        private SetDBIDs[][] partitionData​(Relation<? extends NumberVector> relation,
                                           int bins)
        Partition the data set into bins bins in each dimension independently.

        This can be used to construct a grid approximation of the data using O(d n) memory.

        When a dimension is found to be constant, it will not be partitioned, but instead the corresponding array will be set to null.

        Parameters:
        relation - Data relation to partition
        bins - Number of bins
        Returns:
        Partitions of each dimension.
      • unionDBIDs

        protected HashSetModifiableDBIDs unionDBIDs​(DBIDs[] parts,
                                                    int start,
                                                    int end)
        Compute the union of multiple DBID sets.
        Parameters:
        parts - Parts array
        start - Array start index
        end - Array end index (exclusive)
        Returns:
        Union
      • chiSquaredUniformTest

        private int chiSquaredUniformTest​(SetDBIDs[] parts,
                                          long[] marked,
                                          int card)
        Performs a ChiSquared test to determine whether an attribute has a uniform distribution.
        Parameters:
        parts - Data partitions.
        marked - the marked bins that should be ignored.
        card - Cardinality
        Returns:
        Position of maximum, or -1 when uniform.
      • computeFuzzyMembership

        private void computeFuzzyMembership​(Relation<? extends NumberVector> relation,
                                            java.util.ArrayList<P3C.Signature> clusterCores,
                                            ModifiableDBIDs unassigned,
                                            WritableDataStore<double[]> probClusterIGivenX,
                                            java.util.List<MultivariateGaussianModel> models,
                                            int dim)
        Computes a fuzzy membership with the weights based on which cluster cores each data point is part of.
        Parameters:
        relation - Data relation
        clusterCores - the cluster cores.
        unassigned - set to which to add unassigned points.
        probClusterIGivenX - Membership probabilities.
        models - Cluster models.
        dim - Dimensionality
      • assignUnassigned

        private void assignUnassigned​(Relation<? extends NumberVector> relation,
                                      WritableDataStore<double[]> probClusterIGivenX,
                                      java.util.List<MultivariateGaussianModel> models,
                                      ModifiableDBIDs unassigned)
        Assign unassigned objects to best candidate based on shortest Mahalanobis distance.
        Parameters:
        relation - Data relation
        probClusterIGivenX - fuzzy membership matrix.
        models - Cluster models.
        unassigned - the list of points not yet assigned.
      • hardClustering

        private java.util.ArrayList<P3C.ClusterCandidate> hardClustering​(WritableDataStore<double[]> probClusterIGivenX,
                                                                         java.util.List<P3C.Signature> clusterCores,
                                                                         DBIDs dbids)
        Creates a hard clustering from the specified soft membership matrix.
        Parameters:
        probClusterIGivenX - the membership matrix.
        dbids - mapping matrix row to DBID.
        Returns:
        a hard clustering based on the matrix.
      • findOutliers

        private void findOutliers​(Relation<? extends NumberVector> relation,
                                  java.util.List<MultivariateGaussianModel> models,
                                  java.util.ArrayList<P3C.ClusterCandidate> clusterCandidates,
                                  ModifiableDBIDs noise)
        Performs outlier detection by testing the Mahalanobis distance of each point in a cluster against the critical value of the ChiSquared distribution with as many degrees of freedom as the cluster has relevant attributes.
        Parameters:
        relation - Data relation
        models - Cluster models
        clusterCandidates - the list of clusters to check.
        noise - the set to which to add points deemed outliers.
      • mergeSignatures

        protected P3C.Signature mergeSignatures​(P3C.Signature first,
                                                P3C.Signature second,
                                                int numBins)
        Generates a merged signature of this and another one, where the other signature must be a 1-signature.
        Parameters:
        first - First signature.
        second - Second signature, must be a 1-signature.
        numBins - Number of bins per dimension.
        Returns:
        the merged signature, or null if the merge failed.