Class CLARA<V>

  • Type Parameters:
    V - Data type
    All Implemented Interfaces:
    Algorithm, ClusteringAlgorithm<Clustering<MedoidModel>>, KMedoidsClustering<V>

    @Reference(authors="L. Kaufman, P. J. Rousseeuw",title="Clustering Large Data Sets",booktitle="Pattern Recognition in Practice",url="https://doi.org/10.1016/B978-0-444-87877-9.50039-X",bibkey="doi:10.1016/B978-0-444-87877-9.50039-X") @Reference(authors="L. Kaufman, P. J. Rousseeuw",title="Clustering Large Applications (Program CLARA)",booktitle="Finding Groups in Data: An Introduction to Cluster Analysis",url="https://doi.org/10.1002/9780470316801.ch3",bibkey="doi:10.1002/9780470316801.ch3")
    public class CLARA<V>
    extends PAM<V>
    Clustering Large Applications (CLARA) is a clustering method for large data sets based on PAM, partitioning around medoids (PAM) based on sampling.

    TODO: use a triangular distance matrix, rather than a hash-map based cache, for a bit better performance and less memory.

    Reference:

    L. Kaufman, P. J. Rousseeuw
    Clustering Large Data Sets
    Pattern Recognition in Practice

    L. Kaufman, P. J. Rousseeuw
    Clustering Large Applications (Program CLARA)
    Finding Groups in Data: An Introduction to Cluster Analysis

    Since:
    0.7.0
    Author:
    Erich Schubert
    • Field Detail

      • LOG

        private static final Logging LOG
        Class logger.
      • sampling

        double sampling
        Sampling rate. If less than 1, it is considered to be a relative value.
      • numsamples

        int numsamples
        Number of samples to draw (i.e. iterations).
      • keepmed

        boolean keepmed
        Keep the previous medoids in the sample (see page 145).
    • Constructor Detail

      • CLARA

        public CLARA​(Distance<? super V> distance,
                     int k,
                     int maxiter,
                     KMedoidsInitialization<V> initializer,
                     int numsamples,
                     double sampling,
                     boolean keepmed,
                     RandomFactory random)
        Constructor.
        Parameters:
        distance - Distance function to use
        k - Number of clusters to produce
        maxiter - Maximum number of iterations
        initializer - Initialization function
        numsamples - Number of samples (sampling iterations)
        sampling - Sampling rate (absolute or relative)
        keepmed - Keep the previous medoids in the next sample
        random - Random generator
    • Method Detail

      • run

        public Clustering<MedoidModel> run​(Relation<V> relation,
                                           int k,
                                           DistanceQuery<? super V> distQ)
        Description copied from interface: KMedoidsClustering
        Run k-medoids clustering with a given distance query.
        Not a very elegant API, but needed for some types of nested k-medoids.
        Specified by:
        run in interface KMedoidsClustering<V>
        Overrides:
        run in class PAM<V>
        Parameters:
        relation - relation to use
        k - Number of clusters
        distQ - Distance query to use
        Returns:
        result
      • randomSample

        static DBIDs randomSample​(DBIDs ids,
                                  int samplesize,
                                  java.util.Random rnd,
                                  DBIDs previous)
        Draw a random sample of the desired size.
        Parameters:
        ids - IDs to sample from
        samplesize - Sample size
        rnd - Random generator
        previous - Previous medoids to always include in the sample.
        Returns:
        Sample
      • assignRemainingToNearestCluster

        protected static double assignRemainingToNearestCluster​(ArrayDBIDs means,
                                                                DBIDs ids,
                                                                DBIDs rids,
                                                                WritableIntegerDataStore assignment,
                                                                DistanceQuery<?> distQ)
        Returns a list of clusters. The kth cluster contains the ids of those FeatureVectors, that are nearest to the kth mean.
        Parameters:
        means - Object centroids
        ids - Object ids
        rids - Sample that was already assigned
        assignment - cluster assignment
        distQ - distance query
        Returns:
        Sum of distances.