Class Entropy


  • @Reference(authors="M. Meil\u0103",title="Comparing clusterings by the variation of information",booktitle="Learning theory and kernel machines",url="https://doi.org/10.1007/978-3-540-45167-9_14",bibkey="DBLP:conf/colt/Meila03") @Reference(authors="X. V. Nguyen, J. Epps, J. Bailey",title="Information theoretic measures for clusterings comparison: is a correction for chance necessary?",booktitle="Proc. 26th Ann. Int. Conf. on Machine Learning (ICML \'09)",url="https://doi.org/10.1145/1553374.1553511",bibkey="DBLP:conf/icml/NguyenEB09") @Reference(authors="S. Romano, J. Bailey, X. V. Nguyen, K. Verspoor",title="Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance",booktitle="Proc. 31th Int. Conf. on Machine Learning (ICML 2014)",url="http://proceedings.mlr.press/v32/romano14.html",bibkey="DBLP:conf/icml/RomanoBNV14")
    public class Entropy
    extends java.lang.Object
    Entropy based measures, implemented using natural logarithms.

    Computing the expected mutual information is well optimized, but nevertheless takes O(n) logarithm computations, for n up to 10000, we will perform this more expensive evaluation, for smaller values it will (currently) not be computed. In theory, as n grows, the expected value becomes 0 anyway.

    FIXME: this measure does not yet support noise clusters.

    Key References:

    M. Meilă
    Comparing clusterings by the variation of information
    Learning theory and kernel machines

    X. V. Nguyen, J. Epps, J. Bailey
    Information theoretic measures for clusterings comparison: is a correction for chance necessary?
    Proc. 26th Ann. Int. Conf. on Machine Learning (ICML '09)

    S. Romano, J. Bailey, X. V. Nguyen, K. Verspoor
    Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance
    Proc. 31th Int. Conf. on Machine Learning (ICML 2014)

    Since:
    0.5.0
    Author:
    Sascha Goldhofer, Erich Schubert
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double adjustedArithmeticMI()
      Get the adjusted mutual information using the arithmetic version.
      double adjustedGeometricMI()
      Get the adjusted mutual information using the geometric version.
      double adjustedJointMI()
      Get the adjusted mutual information using the joint version.
      double adjustedMaxMI()
      Get the adjusted mutual information using the max version.
      double adjustedMinMI()
      Get the adjusted mutual information using the min version.
      double arithmeticNMI()
      Get the arithmetic averaged normalized mutual information.
      private static double computeEntropyFirst​(int[][] contingency, int r, int c, double byn, double mlogn, double[] logs)
      Compute entropy of first clustering.
      private static double computeEntropySecond​(int[][] contingency, int r, int c, double byn, double mlogn, double[] logs)
      Compute entropy of second clustering.
      private void computeMIFull​(int[][] contingency, int r, int c, int n, int m, double byn, double mlogn, double[] logs)
      Full computation of mutual information measures, including AMI/EMI.
      private void computeMILarge​(int[][] contingency, int r, int c, double byn, double mlogn)
      Compute mutual information measures, but skip expensive computation of AMI/EMI for large data sets, where they do not differ much.
      double conditionalEntropyFirst()
      Get the conditional entropy of the first clustering (not normalized, 0 = equal).
      double conditionalEntropySecond()
      Get the conditional entropy of the first clustering (not normalized, 0 = equal).
      double entropyFirst()
      Get the entropy of the first clustering (not normalized, 0 = equal).
      double entropyJoint()
      Get the joint entropy of both clusterings (not normalized, 0 = equal).
      double entropyPowers()
      Get Powers entropy (normalized, 0 = equal) Powers = 1 - NMI_Sum
      double entropySecond()
      Get the entropy of the second clustering (not normalized, 0 = equal).
      double expectedMutualInformation()
      Get the expected mutual information.
      double geometricNMI()
      Get the geometric mean normalized mutual information (using the square root).
      double jointNMI()
      Get the joint-normalized mutual information.
      private static double lfac​(int i, double[] lfac)
      Get log(fac(i)) from the cache.
      private static double log​(int i, double[] logs)
      Cached access to log values.
      private static int maxClusterSize​(int[][] contingency, int r, int c)
      Get the maximum cluster size of a contingency table.
      double maxNMI()
      Get the max-normalized mutual information.
      double minNMI()
      Get the min-normalized mutual information.
      double mutualInformation()
      Get the mutual information (not normalized, small values are good).
      double normalizedInformationDistance()
      Get the normalized information distance (normalized, small values are good).
      double normalizedVariationOfInformation()
      Get the normalized variation of information (normalized, small values are good).
      double upperBoundMI()
      Get an upper bound for the mutual information (for scaling).
      double upperBoundVI()
      Get an upper bound for the VI (for scaling).
      double variationOfInformation()
      Get the variation of information (not normalized, small values are good).
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • entropyFirst

        protected double entropyFirst
        Entropy in first
      • entropySecond

        protected double entropySecond
        Entropy in second
      • entropyJoint

        protected double entropyJoint
        Joint entropy
      • mutualInformation

        protected double mutualInformation
        Mutual information (computed directly)
      • variationOfInformation

        protected double variationOfInformation
        Variation of information (computed directly)
      • expectedMutualInformation

        protected double expectedMutualInformation
        Expected mutual information
      • vibound

        protected double vibound
        log(n), for bounding VI
    • Constructor Detail

    • Method Detail

      • computeMILarge

        private void computeMILarge​(int[][] contingency,
                                    int r,
                                    int c,
                                    double byn,
                                    double mlogn)
        Compute mutual information measures, but skip expensive computation of AMI/EMI for large data sets, where they do not differ much.
        Parameters:
        contingency - Contingency table
        r - Rows
        c - Columns
        byn - 1/N factor
        mlogn - -log(N)
      • computeMIFull

        private void computeMIFull​(int[][] contingency,
                                   int r,
                                   int c,
                                   int n,
                                   int m,
                                   double byn,
                                   double mlogn,
                                   double[] logs)
        Full computation of mutual information measures, including AMI/EMI.
        Parameters:
        contingency - Contingency table
        r - Rows
        c - Columns
        n - Total size
        m - Maximum cluster size
        byn - 1/N factor
        mlogn - -log(N)
        logs - Logarithm cache
      • log

        private static double log​(int i,
                                  double[] logs)
        Cached access to log values.
        Parameters:
        i - Parameter
        logs - Logarithm cache
        Returns:
        Logarithm
      • lfac

        private static double lfac​(int i,
                                   double[] lfac)
        Get log(fac(i)) from the cache.
        Parameters:
        i - Factorial
        lfac - Cache
        Returns:
        Cached or computed value.
      • maxClusterSize

        private static int maxClusterSize​(int[][] contingency,
                                          int r,
                                          int c)
        Get the maximum cluster size of a contingency table.
        Parameters:
        contingency - Contingency table
        r - Rows
        c - Columns
        Returns:
        Maximum
      • computeEntropyFirst

        private static double computeEntropyFirst​(int[][] contingency,
                                                  int r,
                                                  int c,
                                                  double byn,
                                                  double mlogn,
                                                  double[] logs)
        Compute entropy of first clustering.
        Parameters:
        contingency - Contingency table
        r - Rows
        c - Columns
        byn - 1 / N
        mlogn - -log(N)
        logs - log value cache
        Returns:
        entropy of first clustering
      • computeEntropySecond

        private static double computeEntropySecond​(int[][] contingency,
                                                   int r,
                                                   int c,
                                                   double byn,
                                                   double mlogn,
                                                   double[] logs)
        Compute entropy of second clustering.
        Parameters:
        contingency - Contingency table
        r - Rows
        c - Columns
        byn - 1 / N
        mlogn - -log(N)
        logs - log value cache
        Returns:
        entropy of second clustering
      • entropyFirst

        public double entropyFirst()
        Get the entropy of the first clustering (not normalized, 0 = equal).
        Returns:
        Entropy of first clustering
      • entropySecond

        public double entropySecond()
        Get the entropy of the second clustering (not normalized, 0 = equal).
        Returns:
        Entropy of second clustering
      • entropyJoint

        public double entropyJoint()
        Get the joint entropy of both clusterings (not normalized, 0 = equal).
        Returns:
        Joint entropy of both clusterings
      • conditionalEntropyFirst

        public double conditionalEntropyFirst()
        Get the conditional entropy of the first clustering (not normalized, 0 = equal).
        Returns:
        Conditional entropy of first clustering
      • conditionalEntropySecond

        public double conditionalEntropySecond()
        Get the conditional entropy of the first clustering (not normalized, 0 = equal).
        Returns:
        Conditional entropy of second clustering
      • entropyPowers

        public double entropyPowers()
        Get Powers entropy (normalized, 0 = equal) Powers = 1 - NMI_Sum
        Returns:
        Powers
      • mutualInformation

        public double mutualInformation()
        Get the mutual information (not normalized, small values are good). Beware of the interpretability issues, so use an adjusted or at least normalized version instead.
        Returns:
        Mutual information
      • upperBoundMI

        public double upperBoundMI()
        Get an upper bound for the mutual information (for scaling).
        Returns:
        Upper bound, using the individual entropies
      • jointNMI

        @Reference(authors="Y. Y. Yao",
                   title="Information-Theoretic Measures for Knowledge Discovery and Data Mining",
                   booktitle="Entropy Measures, Maximum Entropy Principle and Emerging Applications",
                   url="https://doi.org/10.1007/978-3-540-36212-8_6",
                   bibkey="doi:10.1007/978-3-540-36212-8_6")
        public double jointNMI()
        Get the joint-normalized mutual information. Values close to 1 are good.

        Reference:

        Y. Y. Yao
        Information-Theoretic Measures for Knowledge Discovery and Data Mining
        Entropy Measures, Maximum Entropy Principle and Emerging Applications

        Returns:
        Joint Normalized Mutual information
      • minNMI

        @Reference(authors="Tarald O. Kv\u00e5lseth",
                   title="Entropy and Correlation: Some Comments",
                   booktitle="IEEE Trans. Systems, Man, and Cybernetics 17(3)",
                   url="https://doi.org/10.1109/TSMC.1987.4309069",
                   bibkey="DBLP:journals/tsmc/Kvalseth87")
        public double minNMI()
        Get the min-normalized mutual information. Values close to 1 are good.

        Reference:

        T. O. Kvålseth
        Entropy and Correlation: Some Comments
        IEEE Trans. Systems, Man, and Cybernetics 17(3)

        Returns:
        Min Normalized Mutual information
      • maxNMI

        @Reference(authors="Tarald O. Kv\u00e5lseth",
                   title="Entropy and Correlation: Some Comments",
                   booktitle="IEEE Trans. Systems, Man, and Cybernetics 17(3)",
                   url="https://doi.org/10.1109/TSMC.1987.4309069",
                   bibkey="DBLP:journals/tsmc/Kvalseth87")
        public double maxNMI()
        Get the max-normalized mutual information. Values close to 1 are good.

        Reference:

        T. O. Kvålseth
        Entropy and Correlation: Some Comments
        IEEE Trans. Systems, Man, and Cybernetics 17(3)

        Returns:
        Max Normalized Mutual information
      • arithmeticNMI

        @Reference(authors="Tarald O. Kv\u00e5lseth",title="Entropy and Correlation: Some Comments",booktitle="IEEE Trans. Systems, Man, and Cybernetics 17(3)",url="https://doi.org/10.1109/TSMC.1987.4309069",bibkey="DBLP:journals/tsmc/Kvalseth87") @Reference(authors="A. Rosenberg, J. Hirschberg",title="V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure",booktitle="EMNLP-CoNLL 2007",url="https://www.aclweb.org/anthology/D07-1043/",bibkey="DBLP:conf/emnlp/RosenbergH07")
        public double arithmeticNMI()
        Get the arithmetic averaged normalized mutual information. Values close to 1 are good. This is equivalent to the V-Measure for beta=1.

        Reference:

        T. O. Kvålseth
        Entropy and Correlation: Some Comments
        IEEE Trans. Systems, Man, and Cybernetics 17(3)

        A. Rosenberg, J. Hirschberg
        V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure
        EMNLP-CoNLL 2007

        Returns:
        Sum Normalized Mutual information
      • geometricNMI

        @Reference(authors="A. Strehl, J. Ghosh",
                   title="Cluster Ensembles -A Knowledge Reuse Framework for Combining Multiple Partitions",
                   booktitle="J. Mach. Learn. Res. 3",
                   url="http://jmlr.org/papers/v3/strehl02a.html",
                   bibkey="DBLP:journals/jmlr/StrehlG02")
        public double geometricNMI()
        Get the geometric mean normalized mutual information (using the square root). Values close to 1 are good. This was proposed, for example, by:

        A. Strehl, J. Ghosh
        Cluster Ensembles -A Knowledge Reuse Framework for Combining Multiple Partitions
        J. Mach. Learn. Res. 3

        Returns:
        Sqrt Normalized Mutual information
      • variationOfInformation

        public double variationOfInformation()
        Get the variation of information (not normalized, small values are good).
        Returns:
        Variation of information
      • upperBoundVI

        public double upperBoundVI()
        Get an upper bound for the VI (for scaling).
        Returns:
        Upper bound, based on the number of points and clusters.
      • normalizedVariationOfInformation

        public double normalizedVariationOfInformation()
        Get the normalized variation of information (normalized, small values are good). This is 1-jointNMI().
        Returns:
        Normalized Variation of information
      • normalizedInformationDistance

        public double normalizedInformationDistance()
        Get the normalized information distance (normalized, small values are good). This is 1-maxNMI().
        Returns:
        Normalized Variation of information
      • expectedMutualInformation

        public double expectedMutualInformation()
        Get the expected mutual information.
        Returns:
        Expected mutual information
      • adjustedJointMI

        public double adjustedJointMI()
        Get the adjusted mutual information using the joint version.
        Returns:
        Adjusted mutual information
      • adjustedArithmeticMI

        public double adjustedArithmeticMI()
        Get the adjusted mutual information using the arithmetic version.
        Returns:
        Adjusted mutual information
      • adjustedGeometricMI

        public double adjustedGeometricMI()
        Get the adjusted mutual information using the geometric version.
        Returns:
        Adjusted mutual information
      • adjustedMinMI

        public double adjustedMinMI()
        Get the adjusted mutual information using the min version.
        Returns:
        Adjusted mutual information
      • adjustedMaxMI

        public double adjustedMaxMI()
        Get the adjusted mutual information using the max version.
        Returns:
        Adjusted mutual information