Clustering seems easy : you throw your data into a clustering algorithm (like the popular k-means clustering) and see what comes out of it.
FORGET IT !
What is clustering ? Here is one definition (picked ad random from a google search : “Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions.”
So it’s about clusters, “related groups”. Well, not related groups, I suppose, but groups containing related individuals. Generally we see rather explanations like “putting observations in groups so that the distances within the groups are as small as possible and the distances between the groups as large as possible.”
In other words : the groups come into existence by playing around with distances. And that’s what it’s all about : distances.
Distances seem easy : if you work with continuous variables you calculate the Euclidian distance, if you work with categorical variables you can calculate a jaccard distance.
With a bit of standardizing and creativity you can even mix the two.
But that is not what this post is about.
An example will make it more clear.
Suppose you want to cluster the customers of a bank. You will use variables like account balances, number of investment transactions, credit card transactions, mortgage loan balance etc.
It is perfectly possible to combine them into one euclidian distance. But does a difference of 1,000$ between the current account balances of two customers represent the same distance as for example a difference of 15 credit card transactions during the previous month ? How do you compare these two ? How do you decide that the one is more important than the other ?
Eureka ! Standardisation.
Indeed, you can standardize them, so the distributions of the variables become comparable.
But even then. Does a difference of one standard deviation (std) between current account balances represent — in the eyes of the business — the same as one std difference between number of credit card transactions ? Perhaps the bank profit increase due to a current account balance increase of one std is only half than that of a one std increase in number of credit card transactions! In that case it may be appropriate to for example divide the first measure by 2
In order to know that you have to talk to the business people for each and every comparison between two variables.
To be perfectly good the result must be a matrix of pairwise comparisons where each row and each column represents one of your variables. Realise that in some cases this matrix can be huge, as well as the amount of time you will have to spend with you business expert to discuss each pair of variables to come up with some importance ratio, or distance ratio.
And after that you have to make sure that the whole matrix is somewhat consistent. Because if your business expert is sure that variable A is twice as important as variable B and variable C is only half as important as variable A, but still, variable B is far more important than variable C, well, than “Houston we have a problem” and you will have to negotiate.
So indeed, clustering may be rather easy, but getting a good view on the appropriate distance definitions is the hard part.