In this tutorial, we want to implement a new outlier detection method. The outlier definition used in this example is to use the standard deviation of the distances to the k nearest neighbors. Inliers are expected to have a low standard deviation, outliers to have a higher standard deviation (note: in reality, it probably is not that easy, but this is good enough for this tutorial).
The two key APIs in ELKI are the Algorithm interface (and the associated abstract classes and specializations) and the OutlierResult classes for output.
The auto-generated code
Again we start with a stub class, implementing the interface OutlierAlgorithm.
This only creates a single method in the stub, used for an autorun later.
We do not get a predefined run() method, because we can choose between different signatures,
called then by introspection.
Completing the stub
We add a generics in this example, O is the object type that we process.
Since this is dependant on the distance function, we do not make many assumptions,
this is simply a placeholder.
We add a public constructor that takes a distance function and the neighborhood parameter k.
Note that the distance function may take any supertype of O.
For example, O might be DoubleVector, but our distance function is defined on NumberVector.
We fill out the getInputTypeRestriction method, to get input data that matches our distance function
(we return an array, because an algorithm could receive more than one input).
Implementing the run method
Now we need to implement the main algorithm.
From the Algorithm interface,
we inherit a “magic” autorun function. This function uses introspection to find a
suitable run function. In particular, it will look for one of the following signatures:
We need to implement only one of these signatures, the choice is up to us.
The version with relation will save us some manual work, so we’ll use the shortest.
We’ll create the following stub first, that outlines the general flow.
First we initialize the kNN query.
Note that the database may choose to use an optimized kNN query here,
which is why it needs to know the distance function and value of k in advance.
Then we setup a data storage for double values, process the individual elements and finally wrap the result in the expected API. Note that the outlier result API consists of two part: meta data on the score distribution (including minimum and maximum values) and a relation of the actual scores (which essentially is just our data store).
Finally, we fill in the actual outlier detection algorithm:
Adding a parameterizer
Right now, we can invoke the algorithm from Java, but we also want to be able to use the GUI and command line interface. For this we need to implement Parameterization, namely add a Parameterizer. This is as public static inner class named Par (otherwise it will not be found!). The stub obtained from extracting the superclass parameterizer is:
We again need to customize this stub slightly: restrict the distance function type, change the return type and override the makeOptions. The improved stub then is:
There is not much left to do. The distance function is parameterized by the super class. We need to add a parameter for k:
Note that we enforce k > 1 in the parameterization API, as the 1 nearest neighbor will usually be the object itself. As you can see, the parameterizer has the purpose of providing a common parameterization interface and the produces the actual Java instance. It connects the UIs to the actual Java code.