Posted by: zyxo | July 21, 2010

Clustering with decision trees


Decision Tree branch for the information
Image via Wikipedia

There are basically supervised and unsupervised data mining algorithms.

Supervised : the training data set contains categories for which the outcome is indicated. The algorithms have to learn how to interpret the data to be able to predict the outcome. The quality of the model is judged by the percentage of instances where it’s prediction is correct. This Is mostly called classification.
Unsupervised : no categories are known. The algorithm has to do its best to create acceptable categories. The quality of the model is –more or less arbitrarily– judged by someone based on its usability. This Is mostly called clustering.

On of the typical family of classification algorithms are the decision trees.  This is Classification, hence supervised.

So how can we use a supervised algorithm to produce an unsupervised result : categories/clusters out of nothing ?
This seems highly contradictory, but nevertheless, believe me, it is possible.

All you have to do is to give the algorithm a target that

  • has nothing to do with the final clusters it will find
  • but that nevertheless lead it one way or another to these clusters.

Let us start with any dataset, for example of customer data. No categories or clusters are defined. With this dataset a decision tree cannot do anything to find clusters, because you need to feed it some target category that is already known.
The trick is to add a second group of observations to the data set, and the target will be the indication whether the observation comes from the original or the second group of observations.

How will this lead to clustering ?
First of all, we need to focus on the very definition of clustering : finding density differences in a multivariate space.
We can accomplish this by to comparing the initial group of observations with an artificial group :

  • with exactly the same variables
  • with for each variable the same range (the same minimal and maximal value)
  • but where the observations are perfectly evenly spaced (they come from a uniform distribution).

What will happen ?
Your decision tree algorithm will look for density differences between the original and the artificial homogeneous group of observations. Since there are by definition no density differences in our artificial group, it will find the regions with the highest and lowest densities in the original group of observations. QED : this is just what we wanted : the final leaves of our decision tree are the clusters we were looking for.

One final remark
This is by no means my own idea ! The first time I heard this was in Madrid in 2005 at the Salford Systems Data Mining conference. The approach by M. Golovnya was a bit different from the one I described above, but the whole idea was basically the same.
The second time was only recently, when I stumbled upon a paper by Bing Liu , Yiyuan Xia , and Philip S. Yu from 2000. They give a splendid description of the method.

Enhanced by Zemanta

Responses

  1. Because of problems with deployment as SQL (I found it too much work getting K-means running as SQL). I first built a K-means model on a small sample (100k customers) outside of the data warehouse, then tried to scale up to the full customer base. I had quite a bit of success building separate neural network models, each one predicting cluster allocation.

    In my case I had 8 clusters, so I built 8 NN models, each predicting ‘0.0 No’ or ‘1.0 Yes’ with a cluster as my target. The 8 NN models were much easiler to implement as SQL and run on a datawarehouse (at the time using Clementine).

    The results were promising, but I didn’t get perfect allocation. Most of the data was numeric, and I’ve found NN works well (compared to decision trees), but trees are equally as easy to implement as case statements if you wanted to deploy as SQL etc.


Leave a comment

Categories