Class P3C

  • All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<SubspaceModel>>, SubspaceClusteringAlgorithm<SubspaceModel>

    @Title("P3C: A Robust Projected Clustering Algorithm.")
    @Reference(authors="Gabriela Moise, J\u00f6rg Sander, Martin Ester",
               title="P3C: A Robust Projected Clustering Algorithm",
               booktitle="Proc. Sixth International Conference on Data Mining (ICDM \'06)",
    public class P3C
    extends java.lang.Object
    implements SubspaceClusteringAlgorithm<SubspaceModel>
    P3C: A Robust Projected Clustering Algorithm.

    Gabriela Moise, Jörg Sander, Martin Ester
    P3C: A Robust Projected Clustering Algorithm
    In: Proc. Sixth International Conference on Data Mining (ICDM '06)

    This is not a complete implementation of P3C, but good enough for most users. Improvements are welcome. The most obviously missing step is section 3.5 of P3C, where the cluster subspaces are refined.

    Florian Nuecke, Erich Schubert
    • Field Detail

      • LOG

        private static final Logging LOG
        The logger for this class.
      • poissonThreshold

        protected double poissonThreshold
        Parameter for the Poisson test threshold.
      • maxEmIterations

        protected int maxEmIterations
        Maximum number of iterations for the EM step.
      • emDelta

        protected double emDelta
        Threshold when to stop EM iterations.
      • minClusterSize

        protected int minClusterSize
        Minimum cluster size for noise flagging. (Not existing in the original publication).
      • alpha

        protected double alpha
        Alpha threshold for testing.
    • Constructor Detail

      • P3C

        public P3C​(double alpha,
                   double poissonThreshold,
                   int maxEmIterations,
                   double emDelta,
                   int minClusterSize)
        alpha - ChiSquared test threshold
        poissonThreshold - Poisson test threshold
        maxEmIterations - Maximum number of EM iterations
        emDelta - EM stopping threshold
        minClusterSize - Minimum cluster size
    • Method Detail

      • getInputTypeRestriction

        public TypeInformation[] getInputTypeRestriction()
        Description copied from interface: Algorithm
        Get the input type restriction used for negotiating the data query.
        Specified by:
        getInputTypeRestriction in interface Algorithm
        Type restriction
      • constructOneSignatures

        private java.util.ArrayList<P3C.Signature> constructOneSignatures​(SetDBIDs[][] partitions,
                                                                          long[][] markers)
        Construct the 1-signatures by merging adjacent dense bins.
        partitions - Initial partitions.
        markers - Markers for dense partitions.
      • mergeClusterCores

        private java.util.ArrayList<P3C.Signature> mergeClusterCores​(int binCount,
                                                                     java.util.ArrayList<P3C.Signature> signatures)
        Merge 1-signatures into p-signatures.
        binCount - Number of bins in each dimension.
        signatures - 1-signatures
      • pruneRedundantClusterCores

        private java.util.ArrayList<P3C.Signature> pruneRedundantClusterCores​(java.util.ArrayList<P3C.Signature> clusterCores)
        Prune redundant cluster cores
        clusterCores - Cluster cores
        Pruned cluster cores
      • partitionData

        private SetDBIDs[][] partitionData​(Relation<? extends NumberVector> relation,
                                           int bins)
        Partition the data set into bins bins in each dimension independently.

        This can be used to construct a grid approximation of the data using O(d n) memory.

        When a dimension is found to be constant, it will not be partitioned, but instead the corresponding array will be set to null.

        relation - Data relation to partition
        bins - Number of bins
        Partitions of each dimension.
      • unionDBIDs

        protected HashSetModifiableDBIDs unionDBIDs​(DBIDs[] parts,
                                                    int start,
                                                    int end)
        Compute the union of multiple DBID sets.
        parts - Parts array
        start - Array start index
        end - Array end index (exclusive)
      • chiSquaredUniformTest

        private int chiSquaredUniformTest​(SetDBIDs[] parts,
                                          long[] marked,
                                          int card)
        Performs a ChiSquared test to determine whether an attribute has a uniform distribution.
        parts - Data partitions.
        marked - the marked bins that should be ignored.
        card - Cardinality
        Position of maximum, or -1 when uniform.
      • computeFuzzyMembership

        private void computeFuzzyMembership​(Relation<? extends NumberVector> relation,
                                            java.util.ArrayList<P3C.Signature> clusterCores,
                                            ModifiableDBIDs unassigned,
                                            WritableDataStore<double[]> probClusterIGivenX,
                                            java.util.List<MultivariateGaussianModel> models,
                                            int dim)
        Computes a fuzzy membership with the weights based on which cluster cores each data point is part of.
        relation - Data relation
        clusterCores - the cluster cores.
        unassigned - set to which to add unassigned points.
        probClusterIGivenX - Membership probabilities.
        models - Cluster models.
        dim - Dimensionality
      • assignUnassigned

        private void assignUnassigned​(Relation<? extends NumberVector> relation,
                                      WritableDataStore<double[]> probClusterIGivenX,
                                      java.util.List<MultivariateGaussianModel> models,
                                      ModifiableDBIDs unassigned)
        Assign unassigned objects to best candidate based on shortest Mahalanobis distance.
        relation - Data relation
        probClusterIGivenX - fuzzy membership matrix.
        models - Cluster models.
        unassigned - the list of points not yet assigned.
      • hardClustering

        private java.util.ArrayList<P3C.ClusterCandidate> hardClustering​(WritableDataStore<double[]> probClusterIGivenX,
                                                                         java.util.List<P3C.Signature> clusterCores,
                                                                         DBIDs dbids)
        Creates a hard clustering from the specified soft membership matrix.
        probClusterIGivenX - the membership matrix.
        dbids - mapping matrix row to DBID.
        a hard clustering based on the matrix.
      • findOutliers

        private void findOutliers​(Relation<? extends NumberVector> relation,
                                  java.util.List<MultivariateGaussianModel> models,
                                  java.util.ArrayList<P3C.ClusterCandidate> clusterCandidates,
                                  ModifiableDBIDs noise)
        Performs outlier detection by testing the Mahalanobis distance of each point in a cluster against the critical value of the ChiSquared distribution with as many degrees of freedom as the cluster has relevant attributes.
        relation - Data relation
        models - Cluster models
        clusterCandidates - the list of clusters to check.
        noise - the set to which to add points deemed outliers.
      • mergeSignatures

        protected P3C.Signature mergeSignatures​(P3C.Signature first,
                                                P3C.Signature second,
                                                int numBins)
        Generates a merged signature of this and another one, where the other signature must be a 1-signature.
        first - First signature.
        second - Second signature, must be a 1-signature.
        numBins - Number of bins per dimension.
        the merged signature, or null if the merge failed.