Writing a custom distance function
Version information: Updated for ELKI 0.8.0
For many real-world applications, a domain expert may be able to define a domain-specific distance function. For the following, let us assume we are working with 2D data, and the domain expert has decided that an appropriate distance function is dx*dx+abs(dy)
, so taking the difference on the x axis to the square and the y axis linearly.
Basic distance function
Most distances are defined on real-number vectors and return double values. There is a convenient abstract class for this that we can use: AbstractNumberVectorDistance. Let’s start a new class for this, and see what Eclipse generates for us:
package tutorial.distancefunction;
import elki.data.NumberVector;
import distance.AbstractNumberVectorDistance;
public class TutorialDistance extends AbstractNumberVectorDistance {
@Override
public double distance(NumberVector o1, NumberVector o2) {
// TODO Auto-generated method stub
return 0;
}
}
Now lets implement the distance method:
package tutorial.distancefunction;
import elki.data.NumberVector;
import elki.distance.AbstractNumberVectorDistance;
public class TutorialDistanceFunction extends AbstractNumberVectorDistanceFunction {
@Override
public double distance(NumberVector o1, NumberVector o2) {
double dx = o1.doubleValue(0) - o2.doubleValue(0);
double dy = o1.doubleValue(1) - o2.doubleValue(1);
return dx * dx + Math.abs(dy);
}
}
We can now already test this distance function! Yes - we do not need to do more. (If you compiled ELKI and do not use a .jar file, you should now have this class in the dropdown menu. Otherwise, you might need to type in the name of the class.
Fine tuning
Now this domain specific distance function makes only sense for 2-dimensional data. So we will now specify this, so that ELKI does not try to use it with higher dimensional relations. For this, we need to override the method getInputTypeRestriction
.
public class TutorialDistanceFunction extends AbstractNumberVectorDistanceFunction {
// ... as above ...
@Override
public SimpleTypeInformation<? super NumberVector> getInputTypeRestriction() {
return TypeUtil.NUMBER_VECTOR_FIELD_2D;
// shortcut for:
// return VectorFieldTypeInformation.typeRequest(NumberVector.class, 2, 2);
}
}
We now also override the method makeOptions
to configure the variable ps
:
With this statement, we specify three requirements for the input data:
- The vectors must be a vector field (i.e. have the same dimensionality)
- The input data must be NumberVectors (of arbirary type: Float, Double, Integer…)
- The dimensionality must be exactly 2.
If this distance function were metrical, we would also override isMetric()
to contain return true
(this distance function however is not metrical).
Making it parameterizable
In order to make the distance function parameterizable, we write some additional lines. You can read more here: Parameterization
This class already satistfied the parameterizable API: it had an implicit public and parameterless constructor, and can thus be instantiated by the UI. However, if we want to actually have some parameters in the class and a different constructor, we need to help the UI with the parameterization. For this, a static
, public
, inner class called Parameterizer
is used.
So here is a more complex variation of Lp norms where we can specify a different “p” for each dimension.
package tutorial.distancefunction;
import elki.data.NumberVector;
import elki.distance.distancefunction.AbstractNumberVectorDistance;
public class MultiLPNorm extends AbstractNumberVectorDistance {
/**
* The exponents
*/
double[] ps;
/**
* Normalization factor (count(ps)/sum(ps))
*/
double pinv;
/**
* Constructor.
*
* @param ps The exponents
*/
public MultiLPNorm(double[] ps) {
super();
double sum = 0.0;
for(int dim = 0; dim < ps.length; dim++) {
assert (ps[dim] >= 0) : "Negative exponents are not allowed.";
sum += ps[dim];
}
assert (sum > 0) : "At least one exponent should be different from 0!";
this.ps = ps;
this.pinv = ps.length / sum;
}
@Override
public double distance(NumberVector o1, NumberVector o2) {
assert o1.getDimensionality() == ps.length : "Inappropriate dimensionality!";
assert o2.getDimensionality() == ps.length : "Inappropriate dimensionality!";
double sum = 0.0;
for(int dim = 0; dim < ps.length; dim++) {
if(ps[dim] > 0) {
final double delta = Math.abs(o1.doubleValue(dim) - o2.doubleValue(dim));
sum += Math.pow(delta, ps[dim]);
}
}
return Math.pow(sum, pinv);
}
@Override
public SimpleTypeInformation<? super NumberVector> getInputTypeRestriction() {
return VectorFieldTypeInformation.typeRequest(NumberVector.class, ps.length, ps.length);
}
}
If you want, you can think about when this function will be metrical (for example when all ps are constant and >= 1) and implement isMetric()
accordingly.
However, when you try to select this class in the ELKI UI, you will see this error:
Error instantiating class - no usable public constructor.
So we need to add a Parameterization helper next. The generated stub looks like this:
public static class Par implements Parameterizer {
@Override
public Object make() {
// TODO Auto-generated method stub
return null;
}
}
Make sure that you define the class as public static
. Now you must change the return type to your actual class (MultiLPNorm in this case), so it will now look like this:
public static class Par implements Parameterizer {
double[] ps;
// ... we still need to initialize "ps"!
@Override
public MultiLPNorm make() {
return new MultiLPNorm(ps);
}
}
In order to setup the parameters, we have to override the configure
method, and add our options there. Parameterization consists of multiple parts:
- Define a public static OptionID for the parameter (so it can be referenced from other classes!)
- Create an option parameter. Here we need a list of doubles, which is parsed by DoubleListParameter.
- Get the options value from the config object using
grab
. If the value is unavailable, an error will automatically reported, since this parameter was not optional. (Do not throw an exception, so multiple errors can be reported!)
public static class Par implements Parameterizer {
/**
* Option ID for the exponents
*/
public static final OptionID EXPONENTS_ID = new OptionID("multinorm.ps",
"The exponents to use for this distance function");
/**
* P exponents
*/
double[] ps;
@Override
public void configure(Parameterization config) {
new DoubleListParameter(EXPONENTS_ID).grab(config, x -> ps = x);
}
@Override
public MultiLPNorm make() {
return new MultiLPNorm(ps);
}
}