Class WardLinkage

  • All Implemented Interfaces:
    GeometricLinkage, Linkage

    @Reference(authors="J. H. Ward Jr.",title="Hierarchical grouping to optimize an objective function",booktitle="Journal of the American statistical association 58.301",url="https://doi.org/10.1080/01621459.1963.10500845",bibkey="doi:10.1080/01621459.1963.10500845") @Reference(authors="D. Wishart",title="256. Note: An Algorithm for Hierarchical Classifications",booktitle="Biometrics 25(1)",url="https://doi.org/10.2307/2528688",bibkey="doi:10.2307/2528688")
    @Alias({"ward","MISSQ"})
    @Priority(101)
    public class WardLinkage
    extends java.lang.Object
    implements GeometricLinkage
    Ward's method clustering method.

    This criterion minimizes the increase of squared errors, and should be used with squared Euclidean distance. Usually, ELKI will try to automatically square distances when you combine this with Euclidean distance. For performance reasons, the direct use of squared distances is preferable!

    The distance of two clusters in this method is: \[ d_{\text{Ward}}(A,B):=\text{SSE}(A\cup B)-\text{SSE}(A)-\text{SSE}(B) \] where the sum of squared errors is defined as: \[ \text{SSE}(X):=\sum\nolimits_{x\in X} (x-\mu_X)^2 \qquad \text{with } \mu_X=\tfrac{1}{|X|}\sum\nolimits_{x\in X} X \] This objective can be rewritten to \[ d_{\text{Ward}}(A,B):=\tfrac{|A|\cdot|B|}{|A|+|B|} ||\mu_A-\mu_B||^2 = \tfrac{1}{1/|A|+1/|B|} ||\mu_A-\mu_B||^2 \]

    For Lance-Williams, we can then obtain the following recursive definition: \[d_{\text{Ward}}(A\cup B,C)=\tfrac{|A|+|C|}{|A|+|B|+|C|} d(A,C) + \tfrac{|B|+|C|}{|A|+|B|+|C|} d(B,C) - \tfrac{|C|}{|A|+|B|+|C|} d(A,B)\]

    These transformations rely on properties of the L2-norm, so they cannot be used with arbitrary metrics, unless they are equivalent to the L2-norm in some transformed space.

    Because the resulting distances are squared, when used with a non-squared distance, ELKI implementations will apply the square root before returning the final result. This is statistically somewhat questionable, but usually yields more interpretable distances that — roughly — correspond to the increase in standard deviation. With ELKI, you can get both behavior: Either choose squared Euclidean distance, or regular Euclidean distance.

    This method is also referred to as "minimize increase of sum of squares" (MISSQ) by Podani.

    Reference:

    J. H. Ward Jr.
    Hierarchical grouping to optimize an objective function
    Journal of the American statistical association 58.301

    The formulation using Lance-Williams equations is due to:

    D. Wishart
    256. Note: An Algorithm for Hierarchical Classifications
    Biometrics 25(1)

    Since:
    0.6.0
    Author:
    Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  WardLinkage.Par
      Class parameterizer.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static WardLinkage STATIC
      Static instance of class.
    • Constructor Summary

      Constructors 
      Constructor Description
      WardLinkage()
      Deprecated.
      use the static instance STATIC instead.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double combine​(int sizex, double dx, int sizey, double dy, int sizej, double dxy)
      Compute combined linkage for two clusters.
      double distance​(double[] x, int sizex, double[] y, int sizey)
      Distance of two aggregated clusters.
      double initial​(double d, boolean issquare)
      Initialization of the distance matrix.
      double[] merge​(double[] x, int sizex, double[] y, int sizey)
      Merge the aggregated vectors.
      double restore​(double d, boolean issquare)
      Restore a distance to the original scale.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • STATIC

        public static final WardLinkage STATIC
        Static instance of class.
    • Constructor Detail

      • WardLinkage

        @Deprecated
        public WardLinkage()
        Deprecated.
        use the static instance STATIC instead.
        Constructor.
    • Method Detail

      • initial

        public double initial​(double d,
                              boolean issquare)
        Description copied from interface: Linkage
        Initialization of the distance matrix.
        Specified by:
        initial in interface Linkage
        Parameters:
        d - Distance
        issquare - Flag to indicate the input values are already squared
        Returns:
        Initial value
      • restore

        public double restore​(double d,
                              boolean issquare)
        Description copied from interface: Linkage
        Restore a distance to the original scale.
        Specified by:
        restore in interface Linkage
        Parameters:
        d - Distance
        issquare - Flag to indicate the input values were already squared
        Returns:
        Initial value
      • combine

        public double combine​(int sizex,
                              double dx,
                              int sizey,
                              double dy,
                              int sizej,
                              double dxy)
        Description copied from interface: Linkage
        Compute combined linkage for two clusters.
        Specified by:
        combine in interface Linkage
        Parameters:
        sizex - Size of first cluster x before merging
        dx - Distance of cluster x to j before merging
        sizey - Size of second cluster y before merging
        dy - Distance of cluster y to j before merging
        sizej - Size of candidate cluster j
        dxy - Distance between clusters x and y before merging
        Returns:
        Combined distance
      • merge

        public double[] merge​(double[] x,
                              int sizex,
                              double[] y,
                              int sizey)
        Description copied from interface: GeometricLinkage
        Merge the aggregated vectors.
        Specified by:
        merge in interface GeometricLinkage
        Parameters:
        x - Center of the first cluster
        sizex - Weight of the first cluster
        y - Center of the second cluster
        sizey - Weight of the second cluster
        Returns:
        Combined vector
      • distance

        public double distance​(double[] x,
                               int sizex,
                               double[] y,
                               int sizey)
        Description copied from interface: GeometricLinkage
        Distance of two aggregated clusters.
        Specified by:
        distance in interface GeometricLinkage
        Parameters:
        x - Center of the first cluster
        sizex - Weight of the first cluster
        y - Center of the second cluster
        sizey - Weight of the second cluster
        Returns:
        Distance