Thursday, April 02, 2009

Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.

Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).


Dean Abbott said...

FYI: I left a comment on this topic at the original site with some specifics on implementation of normalization in Clementine.

Will Dwinnell said...

This is a good point, Dean. I have wondered how much difference it might make to use different normalizations. The standard, of course, is subtract-and-divide to get zero-mean and unit-standard deviation, but some distributions are poorly characterized by these summaries. I have used subtract-and-divide to get zero-median and unit-IQR, but there are plenty of other options.

Ted Dunning ... apparently Bayesian said...

Highly correlated variables make it difficult to use k-means directly. Essentially what happens is that you have the same sort of scaling problem as you are discussing, but aligned at some angle relative to the original axes. Getting right of that problem requires not just normalizing each axis independently, but using an correlation matrix to realign and scale the axes.

Here is some R code that illustrates this. First take some data that needs normalization like you describe:

u = rnorm(200)
v = c(rnorm(100), 10+rnorm(100))
plot(10*u, v, type='p')

This gives us horizontally elongated clusters that can be handled well using per variable normalization.

But if we take a correlated case:

x = 10*u-v;y = 10*u + v
plot(x, y, type='p')

Then we get elongated clusters at a 45 degree angle. This can be very difficult to deal with using k-means because the correlation in the different clusters can easily be different. To deal with this, it is really better to use a more advanced clustering such as a Gaussian mixture model or spectral clustering. Non-parametric Bayesian approaches to the mixture modeling can be particularly useful since you don't have to commit to a set number of clusters.

Unknown said...

Hi friends,
I have written a blog on Data Preprocessing emphasizing its significance through a simple example for Normalization. Please do visit and leave your comments.

nvasil said...

This is a deeper problem. Most of the time euclidean metric is not the correct. The weighted eclidean metric is the right one. There are some pretty good methods like Kmeans with metric learning.

Also you can find a fast implementation of kmeans here
It has weighted metric (although it is still under development). It was much faster than weka/matlab

Dean Abbott said...

nvasil: Thanks for the comment and link; I will be looking it up. The biggest problem for me is that I usually use commercial tools, and if they don't have the particular option of weighted Euclidean or Mahalanobis distance, unless I can fake it (for example with Mahalanobis by pre-transforming by the covariance matrix), I'm out of luck. I guess that is an argument for using Matlab, right Will? :)

Anonymous said...

Just posted this reply to a related question about k-means clustering on Data Science Central, and thought it would be a good addition here (, where the question was: "How is 'k' determined in k-means clustering (using FASTCLUS)?"

If there is not an operational definition for the number of clusters, yes, you have to figure this out yourself. You can use an algorithm to figure it out, but how do you know the algorithm is trading off the # clusters vs. compactness the way you want?

You have to have some idea of what you want, of course, but usually in my consulting engagements where k was unknown we would do the following.

1) interpret the clusters
there are two ways to interpret clusters. First, we compute the mean values of all the input variables to get the gist of where the clusters are centered. (normalizing the input variables can greatly influence the formation of clusters and these mean values. I have a blog post on this topic here (

the second way is to compute how the clusters differ from one another. You can compute the mean values of every variable in the clusters, but it could be that all the variables except one have the same mean for every cluster--it's just one variable that is really responsible for driving the formation of the clusters. But how do you easily find these differences, especially when you have perhaps dozens of input variables?
You can eyeball it, but that can be tedious and was to get wrong. I prefer to find this algorithmically. How? By using decision trees to predict the cluster label from the same inputs. (after all, the one thing that stood in our way of doing supervised learning in the first place was that we didn't have labels for the data. now that we have clusters, we have record labels!) The tree doesn't have to be perfect, just get the gist of the differences for you to understand the key differences between clusters.

2) overlay the clusters with another measure of interest
If you have another important variable that is important, even if that variable was not included in the cluster analysis, if you compute it's mean value (or IQR) for each cluster, you can get a sense for what the clusters may mean operationally. For example, if you are computing clusters of customers, you can overlay demographics on top of the clusters (age, income, home value, etc.). Or, when I built fraud related clusters where we had so few adjudicated fraud cases that we couldn't build supervised learning models, we can still overlay the fraud label even for the relatively few cases we have to get a sense for which clusters those fraudulent transactions landed in.

So there are huge differences between selecting the number of clusters based on operational concerns vs. numeric concerns. In the latter case, cluster compactness and separation may not be the most important aspects to consider. (though sometimes they may be too!).