Same-size k-Means Variation
Version information: ELKI 0.8.0
In this tutorial, we will create a k-means variation that produces clusters of the same size.
The basic idea of the algorithm is as follows:
Initialization:
- Compute the desired cluster size, n/k.
- Initialize means, preferably with k-means++
- Order points by the distance to their nearest cluster minus distance to the farthest cluster (= biggest benefit of best over worst assignment)
- Assign points to their preferred cluster until this cluster is full, then resort remaining objects, without taking the full cluster into account anymore
This initialization is not optimal - feel free to contribute an improvement to this tutorial! - in particular for the last cluster. But it will do as an initialization method.
Additional data structure: for every cluster we keep a list of objects willing to leave the cluster. This makes it easier to find transfer candidates. Whenever we have an element that wants to enter a cluster, the top element wanting to leave will be considered for switching. (It may be worth a try to consider more than the top element!)
Iteration:
- Compute current cluster means
- For each object, compute the distances to the cluster means
- Sort elements based on the delta of the current assignment and the best possible alternate assignment.
- For each element by priority:
- For each other cluster, by element gain, unless already moved:
- If there is an element wanting to leave the other cluster and this swap yields and improvement, swap the two elements
- If the element can be moved without violating size constraints, move it
- If the element was not changed, add to outgoing transfer list.
- For each other cluster, by element gain, unless already moved:
- If no more transfers were done (or max iteration threshold was reached), terminate
Convergence can be proven the same way as with k-means: any transfer must decrease variance (and would also happen in k-means).
Room for improvement: better logic for finding optimal transfers, e.g., by solving triangles: object 1 from A to B, object 2 from B to C, object 3 from C to A also preserves sizes!
Auto-generated code
Since this is a K-means variant, we can use the base class AbstractKMeans. Generating the standard methods, adding a logger and choosing the generics yields the following code blueprint.
For the generics, we want to allow arbitrary vector types: V extends NumberVector
.The cluster model will be a simple MeanModel
(using a KMeansModel
instead would trigger Voronoi cell visualization, which is no longer appropriate).
The reason why we have to choose these generics is that the abstract class AbstractKMeans
is also used, e.g., for KMedians and similar algorithms with looser constraints than KMeans or our new variation.
For the constructor, we narrow down the distance function to a NumberVectorDistance<? super V>
. This would only be automatic if we had this yet another generic in AbstractKMeans
. This narrower distance function has two properties important to us: A) it can be computed on arbitrary number vectors (including the means, we will be computing) and B) they return a double value. In case you are wondering: yes, in ELKI there exist distance functions not defined on number vectors and not having double values.
The main method stub
Note that no run
method was generated automatically. This is due to a limitation in Java interfaces, and ELKI allowing multi-relational algorithms. Our algorithm, however, is just single relational. So we can use a very simple run
method:
In contrast to the outlier tutorial, we do not have to implement getInputTypeRestriction()
- we inherit an appropriate implementation from AbstractKMeans
(which actually uses the restriction given by the distance function).
Object metadata
To find the optimal assignment and evaluate transfers, we need some data on the objects ready. We will need some extra memory for this, on the order of O(n*k). In particular, we want to track the distances from each object to each cluster center, but also which cluster the object is the closest to, and which would be a better choice. For this, we need a simple utility inner class, which we call Meta
:
to make the code more readable, we also add two utility functions to this inner class. The first, priority
computes the priority we use in the initialization phase: the difference between the worst and the best assignments. The second, gain
computes the expected gain from moving the object from the current assignment to a different cluster.
and of course we need to build an initial data storage. First, we need to cast the distance function (unless we add yet another generics to the parent class!). Then we get a storage from the data storage layer. Depending on the database size and properties, this can be anything like an array (the default case for static data sets), a hash map (the default for dynamic data sets) or a B-tree (in the future, for data sets that don’t fit into main memory).
We then iterate over all objects. for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance()) {
is the usual iterator construct in ELKI. This needs less memory than the usual java Iterator
classes for reasons outlined here. We then compute the distance to each mean, and track the largest (in c.primary
) and smallest (in c.secondary
) such distance to be used in the initialization phase.
Performing the initial assignment
For the initial assignment, we first will need to sort the data set. Technically, we are just sorting references to the database, which can conveniently be done in a ArrayModifiableDBIDs
class. This class is similar to an ArrayList<DBID>
, but again will use significantly less memory. Plus, it offers additional functionality. For sorting we need a comparator, which will compare objects by the priority
method we defined above.
For the actual initialization, we may need to resort multiple times. We use the variable start
to track which elements we already processed, and only sort the elements after this. This time we need an advanced iterator: DBIDArrayIter
can not only move forward, but can also move backward and seek to arbitrary positions in the array. We use this to seek to the starting position of the current iteration.
We now assign the next element to it’s preferred cluster. The cluster cannot be full yet, which we will actually ensure below. If you feel uncomfortable, you can easily add an assertion.
However, after adding an element, it might be full now. In this case, we want to prevent any further objects of being assigned to this cluster, and also re-sort the data set.
As we can seek back to our current position start
, we can just use this iterator to go through all remaining objects, and update their preferred cluster assignment, removing the references to the now full cluster. The continue
taking us to the next iteration (and thus updating the sorting and restarting at start
) is strictly speaking not necessary: the iterator is at the end of the data set now anyway.
Note that we ensured that Meta.primary
contains the cluster number the object is currently assigned to. We do not ensure a particular value for Meta.secondary
. We could also discard the tids
array at this point, but we can also just re-use it in the next part.
Updating the means
For updating the means, we are just going to use the method provided by the abstract super class:
We also need to update the distances from each object to the current (possibly changed!) means. We will do this in a helper function. In addition, we also track which mean would be the best alternate assignment. So after this run, Meta.primary
still is the cluster the object is currently assigned to, while Meta.secondary
now is the best other choice.
The refinement procedure
This brings us finally to the core of our algorithm: the refinement procedure, which performs Lloyd-Forgy-style iterations to improve the result, but in contrast to these ensures that the cluster sizes remain in the given bounds. First of all, we compute these bounds, in the form of an upper and a lower limit, if the number of objects is not divisible by k
.
We also need a different comparator than before, which now uses the Meta.priority
method we wrote earlier.
We also need a second comparator. The code will follow below; it’s not particularly tricky, but sorts an int[]
array of cluster numbers [0..k-1]
by increasing distance to the cluster means.
In the idea description of the algorithm we already mentioned transfer lists. We also need to initialize these:
and can now finally start with the main iteration loop. If maxiter
is set to -1
we will be performing an infinite number of iterations, otherwise at most maxiter
iterations. We then call the helper function above to update the distances and sort the objects by their Meta.priority
. The active
counter will be used to detect convergence.
We now iterate over the data set and get the metadata. We now use the second comparator to sort the clusters numbers in preferences
by their distance to the current object and then iterate over each target cluster.
First of all we try to find a waiting transfer partner. The actual transfer will be done by the method transfer
explained below. Also note that we see yet another iterator variant: DBIDMIter
is a modifiable iterator - it also has a remove
method, which we need this time.
Even if we do not have a transfer partner, we might be able to move the object: if the number of elements was not divisible by k
it may be possible to move single objects instead of swapping two objects.
Finally, if the object could not be transferred but isn’t in the optimal cluster, we store it in the appropriate transfer list:
Someone might come up with a good way of finding additional transfers. But otherwise, we need to clear these lists for the next iteration. We can also log the progress of our algorithm.
Finally, if no transfers were done we can break the main iteration loop. If however any change was performed, we need to update the means again.
In the code above two parts were not yet introduced. The first is the comparator for sorting the cluster preferences. This is fairly standard code, except that we are using an IntegerComparator
instead of a Comparator<Integer>
from the Java collections framework. By avoiding Integer
objects and using primitive int
s instead we reduce memory use and improve performance. This isn’t crucial here, but we might as well just use it. The select
method also serves the purpose that we can use this comparator for different objects instead of creating a fresh one for each Meta
.
and the second is the transfer
function to update the data structures:
Implementing the algorithm outline
We have now seen all the main parts of the algorithm, and only need to put these pieces together to the final run
method:
Don’t mind the odd signature of initializer.chooseInitialMeans
. The reason for this is that we want to allow both random generated means, samples from the database, optimized means as produced by k-means++ etc. - for now, we will only need them to be “numerical vectors of some kind”, which is what this type says.
The main thing we have not seen yet is how to package the final result for the ELKI framework. As we want the visualization and evaluation procedures to be able to use our result, we need to wrap it in the appropriate classes. The outer class is Clustering, and represents the result as a whole. We then need to produce the individual Cluster objects. In ELKI, each cluster has a Model, in our case the model will store the cluster mean.
Adding a parameterizer
Now in order to run the algorithm we could call the constructor above, produce a database and so on. Or we can let ELKI handle all the set up, and for example use the MiniGUI or command line. For this, we need to formally specify the parameters. Here, we have four parameters: the desired number of clusters k
, the maximum number of iterations maxiter
, the initialization method initializer
and the distance function distance
.
To specify parameters, we need to override the method makeOptions
. First we will get the distance function. While the k-means algorithm is actually only designed for squared Euclidean distance (or Euclidean), some user might want to try different distances (be careful with this though: if the mean is not reducing distances, the algorithm might no longer converge!).
The next parameter is the number of clusters. It must be greater than 1 to make sense, so we add a constraint.
Then we need an initialization method. As we rely on the initialization to be good, we set the default to be the k-means++ initialization method.
Finally, we have the optional parameter with the maximum number of iterations. As default value we use -1
, so it iterates until convergence.
Now we need to map these parameters to the constructor, which is straightforward:
Putting it all together
The complete source can be found here: SameSizeKMeans.java
It is included in the elki-tutorial
addon module for you to try out.
Running the algorithm
We can now launch the MiniGUI and run the algorithm from the user interface:
You can see the paramters we just defined show up: k must be set, while the others have default values.
Running it on the mouse data set with k=3 will produce a result like this:
while setting k=5 gives a more intuitive result, in particular if we merge the inner three clusters.
You can clearly see that we achieved our objective of producing equally sized clusters. And we never promised you that this tutorial algorithm would be useful in practise on a real data set. In reality, clusters will rarely have always the same size. But if you happen to find a situation where this algorithm works well for you, we would be happy to hear!