Thursday, April 02, 2009

Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.

Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).


Dean Abbott said...

FYI: I left a comment on this topic at the original site with some specifics on implementation of normalization in Clementine.

Will Dwinnell said...

This is a good point, Dean. I have wondered how much difference it might make to use different normalizations. The standard, of course, is subtract-and-divide to get zero-mean and unit-standard deviation, but some distributions are poorly characterized by these summaries. I have used subtract-and-divide to get zero-median and unit-IQR, but there are plenty of other options.

Ted Dunning ... apparently Bayesian said...

Highly correlated variables make it difficult to use k-means directly. Essentially what happens is that you have the same sort of scaling problem as you are discussing, but aligned at some angle relative to the original axes. Getting right of that problem requires not just normalizing each axis independently, but using an correlation matrix to realign and scale the axes.

Here is some R code that illustrates this. First take some data that needs normalization like you describe:

u = rnorm(200)
v = c(rnorm(100), 10+rnorm(100))
plot(10*u, v, type='p')

This gives us horizontally elongated clusters that can be handled well using per variable normalization.

But if we take a correlated case:

x = 10*u-v;y = 10*u + v
plot(x, y, type='p')

Then we get elongated clusters at a 45 degree angle. This can be very difficult to deal with using k-means because the correlation in the different clusters can easily be different. To deal with this, it is really better to use a more advanced clustering such as a Gaussian mixture model or spectral clustering. Non-parametric Bayesian approaches to the mixture modeling can be particularly useful since you don't have to commit to a set number of clusters.

Venki said...

Hi friends,
I have written a blog on Data Preprocessing emphasizing its significance through a simple example for Normalization. Please do visit and leave your comments.

nvasil said...

This is a deeper problem. Most of the time euclidean metric is not the correct. The weighted eclidean metric is the right one. There are some pretty good methods like Kmeans with metric learning.

Also you can find a fast implementation of kmeans here
It has weighted metric (although it is still under development). It was much faster than weka/matlab

Dean Abbott said...

nvasil: Thanks for the comment and link; I will be looking it up. The biggest problem for me is that I usually use commercial tools, and if they don't have the particular option of weighted Euclidean or Mahalanobis distance, unless I can fake it (for example with Mahalanobis by pre-transforming by the covariance matrix), I'm out of luck. I guess that is an argument for using Matlab, right Will? :)

Price Compare said...

This is very good article for me.
Lowest price Television

Tina Mortin said...
This comment has been removed by the author.
Dean Abbott said...

Just posted this reply to a related question about k-means clustering on Data Science Central, and thought it would be a good addition here (, where the question was: "How is 'k' determined in k-means clustering (using FASTCLUS)?"

If there is not an operational definition for the number of clusters, yes, you have to figure this out yourself. You can use an algorithm to figure it out, but how do you know the algorithm is trading off the # clusters vs. compactness the way you want?

You have to have some idea of what you want, of course, but usually in my consulting engagements where k was unknown we would do the following.

1) interpret the clusters
there are two ways to interpret clusters. First, we compute the mean values of all the input variables to get the gist of where the clusters are centered. (normalizing the input variables can greatly influence the formation of clusters and these mean values. I have a blog post on this topic here (

the second way is to compute how the clusters differ from one another. You can compute the mean values of every variable in the clusters, but it could be that all the variables except one have the same mean for every cluster--it's just one variable that is really responsible for driving the formation of the clusters. But how do you easily find these differences, especially when you have perhaps dozens of input variables?
You can eyeball it, but that can be tedious and was to get wrong. I prefer to find this algorithmically. How? By using decision trees to predict the cluster label from the same inputs. (after all, the one thing that stood in our way of doing supervised learning in the first place was that we didn't have labels for the data. now that we have clusters, we have record labels!) The tree doesn't have to be perfect, just get the gist of the differences for you to understand the key differences between clusters.

2) overlay the clusters with another measure of interest
If you have another important variable that is important, even if that variable was not included in the cluster analysis, if you compute it's mean value (or IQR) for each cluster, you can get a sense for what the clusters may mean operationally. For example, if you are computing clusters of customers, you can overlay demographics on top of the clusters (age, income, home value, etc.). Or, when I built fraud related clusters where we had so few adjudicated fraud cases that we couldn't build supervised learning models, we can still overlay the fraud label even for the relatively few cases we have to get a sense for which clusters those fraudulent transactions landed in.

So there are huge differences between selecting the number of clusters based on operational concerns vs. numeric concerns. In the latter case, cluster compactness and separation may not be the most important aspects to consider. (though sometimes they may be too!).

Unknown said...

apple safari download
Internet Explorer 11 Download
Mozilla firefox 2017 free download

Unknown said...

mozilla firefox 2017
AVG antivirus 2017
apple safari download

Unknown said...

Avast Antivirus free 2017
Google Chrome Download 2017
Download Google chrome 2017

Unknown said...

Avast Antivirus free 2017
Google Chrome Download 2017
Download Google chrome 2017

Unknown said...

mozilla firefox 2017 free download
AVG free download 2017

Prachi Sharma said...

Analogica is one of the best Big Data solutions provider in india, offers speed, reliability and economically viable solutions and Big Data Solutions Provider
.Every project is undertaken by team of highly skilled data scientists.

Anamika Sharma said...

Amazed to be here and want to read this amazing info. Thanks for this interesting article and i would really want to say thanks for this info.

hifi escorts in Dwarka

Jigar Pandya said...

Nice post. thanks to share with us.
UAN Login Guide

Munwar .sheik said...

Ramadan kareem 2017 wallpapers

Ramadan kareem wallpapers

Ramadan mubarak images

eid ul fitr greetings

eid mubarak greetings 2017

eid greetings for 2017

eid mubarak dp

{*JDK *}java jdk download for windows 8 64 bit

eid ul fitr essay for class 1,2,3,4,5,6

Latest Openings In BPO

Flappy Bird APK

Manpreet Kashyap said...

uan employee portal
epf balance check
pf balance check
employee uan login
uan employee login
uan login for employees
uan activation
uan number activation
uan login activation
epf passbook download uan
uan member portal passbook
UAN Passbook
EPF Passbook
Check Your UAN Status
check pf status by uan number
Know your UAN Status
Know your UAN from EPF Number
Check Your EPF Balance

Miriam Steve said...

I am very greatly honored to see information on Why normalization matters with K-Means, something that's very new and which i must have been missing a lot. I can boldly say that the work you have done is great and you really deserve commendation. If you feel the need to hire the best Website Content Managing Service, then check the link and you will be provided with the necessary information.

Windows Beginner said...

You are so interesting! I don't believe I've truly read through anything like that before.
change the default search engine in Microsoft Edge
critical process died windows 10

karamath sheik said...


Check out our now venture called

badmanlive said...

Game Guardian is an amazing game hack/alteration tool. Game guardian helps you modify money, HP, Sp and more aspects of the game.
Game Guardian apk

badmanlive said...

It's surprising you aren't more popular given that you definitely possess the gift.
live stream ipl auction
ipl 11 auction live

Renci Jain said...

I offer discreet companionship support and light at the end of tube to I understand to equal and Awakens by each one of an upscale individual and devote a few position predate together. Barring no one issues in the two ways.
Kindly check out my partner website:-
Mumbai Escorts
Goa Escorts
Jaipur Escorts
Gurgaon Escorts
Jaipur Call girls
Call girls in Goa
Call girls in Mumbai
Nainital Escorts
Goa Call Girls

spybot windows 10 said...

Very Interesting and wonderfull information keep sharing
spybot search and destroy review 2017

Ogen Infosystem said...
This comment has been removed by the author.

Good post.
Big Data Training Classes in Jaipur

Escorts in Dubai said...

Dubai International Escorts provides several female escorts services in Dubai

Dubai Female Escorts
Femael Escorts in Dubai
Best Escort in Dubai
Dubai high profile escorts
Dubai Call Girls
Dating Escort in Dubai
Dubai Dating Escort
High Class Escort in Dubai
Air hostess Dubai Escort
Dubai Air hostess Escorts
Hi fi Dubai Escort
High Society Escort in Dubai
Escorts services in Dubai
Dubai Escorts Services
Dubai Escorts Agency
Escorts Agency in Dubai
Independent Escorts in Dubai
Housewife Escorts in Dubai
Celebrity Escorts in Dubai
Student Escorts in Dubai
Indian Escorts in Dubai
Call Girls in Dubai
Hot Dubai Escorts
Dubai Hot Escorts
Sexy Escorts in Dubai
Dubai Sexy Escorts
Dubai International Escorts
International Escorts Dubai

Click here for more information at -
Call and Whatsapp at - +91-9967566114

gurgaon escort said...

Call Girls In Gurgaon,Most of the individuals in Escorts Service In Gurgaon, Hotel Service work and spend more time 24Hrs Female Services. gurgaon Escort

dwarka escort said...

Young College Dwarka Call Girls: The famous primitive category of our girls popular among all ages of Men. They are the dedicated young escort who loves to explore new person at peak heights. dwarka escort

Unknown said...

Well done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful R Programming tutorial on youtube so you can check that too for more detailed knowledge on R Programming.

Shilpa Escorts said...

What buddies would be you really seeing that you are not going to find in to some other location, so it wasn't simple work to individuals personally, so as to locate them we now have wasted several years at the hunting of these and then we'll discover some intelligent and key associates of the association, which could be your critical portion of our Gurgaon Escorts company? We’re maybe not the sole escort agency at Gurgaon Read More :- Escort In Gurgaon

Hyderabad Escorts said...

On our Hyderabad Escorts accessible to book for your diversion now!
Hyderabad Escorts

Sugar Babes said...

Dubai Escorts

Dubai Escort

Escorts in Dubai

Indian Escorts in Dubai

Indian Escort in Dubai

HighProfile Escorts in Dubai

kapil sharma said...

Call Girls In Gurgaon,Most of the individuals in Escorts Service In Gurgaon, Hotel Service work and spend more time 24Hrs Female Services. Gurgaon Escorts Service