Thursday, April 02, 2009

Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.

Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).


Dean Abbott said...

FYI: I left a comment on this topic at the original site with some specifics on implementation of normalization in Clementine.

Will Dwinnell said...

This is a good point, Dean. I have wondered how much difference it might make to use different normalizations. The standard, of course, is subtract-and-divide to get zero-mean and unit-standard deviation, but some distributions are poorly characterized by these summaries. I have used subtract-and-divide to get zero-median and unit-IQR, but there are plenty of other options.

Ted Dunning ... apparently Bayesian said...

Highly correlated variables make it difficult to use k-means directly. Essentially what happens is that you have the same sort of scaling problem as you are discussing, but aligned at some angle relative to the original axes. Getting right of that problem requires not just normalizing each axis independently, but using an correlation matrix to realign and scale the axes.

Here is some R code that illustrates this. First take some data that needs normalization like you describe:

u = rnorm(200)
v = c(rnorm(100), 10+rnorm(100))
plot(10*u, v, type='p')

This gives us horizontally elongated clusters that can be handled well using per variable normalization.

But if we take a correlated case:

x = 10*u-v;y = 10*u + v
plot(x, y, type='p')

Then we get elongated clusters at a 45 degree angle. This can be very difficult to deal with using k-means because the correlation in the different clusters can easily be different. To deal with this, it is really better to use a more advanced clustering such as a Gaussian mixture model or spectral clustering. Non-parametric Bayesian approaches to the mixture modeling can be particularly useful since you don't have to commit to a set number of clusters.

Venki said...

Hi friends,
I have written a blog on Data Preprocessing emphasizing its significance through a simple example for Normalization. Please do visit and leave your comments.

nvasil said...

This is a deeper problem. Most of the time euclidean metric is not the correct. The weighted eclidean metric is the right one. There are some pretty good methods like Kmeans with metric learning.

Also you can find a fast implementation of kmeans here
It has weighted metric (although it is still under development). It was much faster than weka/matlab

Dean Abbott said...

nvasil: Thanks for the comment and link; I will be looking it up. The biggest problem for me is that I usually use commercial tools, and if they don't have the particular option of weighted Euclidean or Mahalanobis distance, unless I can fake it (for example with Mahalanobis by pre-transforming by the covariance matrix), I'm out of luck. I guess that is an argument for using Matlab, right Will? :)