Posted by: zyxo | April 24, 2009

Why clustering is difficult


This image is part of a series of images showi...
Image via Wikipedia

Is clustering difficult ?
You just take your data and run it trough a clustering algorithm like k-means clustering , and you have your result …

Of cause you could do that, but what will be the quality of the result ?

For a good clustering you have to resolve three problems :
1. which clustering algorithm to use ?
2. what definition of distance to use ?
3. choosing your clusters

1. the choice of the clustering algorithm is in my opinion the easiest of the three. I will not go into a taxonomy of possible clustering algorithms, you find them everywhere.

2. The first hard problem is finding a good definition / calculation of distance. Clustering is based on distances (maximizing distances between clusters, minimizing distances within clusters).
I am not talking of geographical locations here, that’s too simple since in that case distances are … well … distances : miles, kilometers or whatever.
But try to define a distance between two customers, based on for example 500 variables like age, account balances, time since last purchase, which are continuous variables and some handfulls of categorical variable like gender, type of environment they live in, are they married or not ? etc.

What is then the distance measure ?

With the continuous variables you could calculate an euclidean distance after converting all (standardized) variables to principal components which are orthogonal. But what is the business meaning of such a distance ?
Has a difference of one standard deviation along variable X (e.g. total purchase amount during the last month) the same value for the business as a comparable difference along variable Y (e.g. age) ?

The same problem arises with categorical variables. You can simply count the number (or proportion) of non-matching categorical variables. But is the difference between married or not married equally important for your business as the difference between man and woman ?

The bulk of the hard labour comes at this stage : if you want to deliver a good clustering, you first have to talk for many hours an days with your business people to know
1) which variables are relevant to the clustering (what do they want to use the clusters for ?) and whch to discard.
2) to accord a weight to each selected variable. Variable X kan be three or ten times more important for your business than variable Y. You should take this into account.

Only then can you go to the next stage : calculating the distances.

Then comes the easy part : choosing and using the clustering algorithm. Based upon the characteristics of the algorithms en the known types of clusterings these generally produce you should be able to make a decent choice.

The second really difficult part is selecting which result to chose.
Will you be satisfied with only one clustering? I recommend to use different samples of your data to check whether the calculated clusters are stable. Do you get each time a similar result ? Great ! Then you have to verify with your business people whether the result makes some sense :
– is there any business logic that explains the clusters ? (If you did a good job selecting the variables and weighing them up this should be no problem !).
– is the number of clusters not too big ? too small ? Considering merging two adjacent clusters is a good option. (thanks to Ned Kumar for pointing this out).

But what if not ? What if you end up with 15 totally different clusterings from 15 random samples ? This simply means that there are no clusters in your world and the “clusters” you found are only the products of random variation.

In that case there is one simple solution left : a) calculate the distance matrix. b) Run a multidimensional scaling, c) plot the result on some charts and finally d) let your business user choose where to cut.

Did you liked this post ? Then you might be interested in the following :
Oversampling or undersampling ?
data mining with decision trees : what they never tell you
The top-10 data mining mistakes
Mining highy imbalanced data sets with logistic regressions

Reblog this post [with Zemanta]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: