Thursday, April 02, 2009

Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.


















Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).

61 comments:

Dean Abbott said...

FYI: I left a comment on this topic at the original site with some specifics on implementation of normalization in Clementine.

Will Dwinnell said...

This is a good point, Dean. I have wondered how much difference it might make to use different normalizations. The standard, of course, is subtract-and-divide to get zero-mean and unit-standard deviation, but some distributions are poorly characterized by these summaries. I have used subtract-and-divide to get zero-median and unit-IQR, but there are plenty of other options.

Ted Dunning ... apparently Bayesian said...

Highly correlated variables make it difficult to use k-means directly. Essentially what happens is that you have the same sort of scaling problem as you are discussing, but aligned at some angle relative to the original axes. Getting right of that problem requires not just normalizing each axis independently, but using an correlation matrix to realign and scale the axes.

Here is some R code that illustrates this. First take some data that needs normalization like you describe:

u = rnorm(200)
v = c(rnorm(100), 10+rnorm(100))
plot(10*u, v, type='p')

This gives us horizontally elongated clusters that can be handled well using per variable normalization.

But if we take a correlated case:

x = 10*u-v;y = 10*u + v
plot(x, y, type='p')

Then we get elongated clusters at a 45 degree angle. This can be very difficult to deal with using k-means because the correlation in the different clusters can easily be different. To deal with this, it is really better to use a more advanced clustering such as a Gaussian mixture model or spectral clustering. Non-parametric Bayesian approaches to the mixture modeling can be particularly useful since you don't have to commit to a set number of clusters.

Venki said...

Hi friends,
I have written a blog on Data Preprocessing emphasizing its significance through a simple example for Normalization. Please do visit and leave your comments.

http://intelligencemining.blogspot.com/2009/07/data-preprocessing-normalization.html

nvasil said...

This is a deeper problem. Most of the time euclidean metric is not the correct. The weighted eclidean metric is the right one. There are some pretty good methods like Kmeans with metric learning.

Also you can find a fast implementation of kmeans here
http://www.analytics1305.com/documentation/algorithm_reference.html
It has weighted metric (although it is still under development). It was much faster than weka/matlab

Dean Abbott said...

nvasil: Thanks for the comment and link; I will be looking it up. The biggest problem for me is that I usually use commercial tools, and if they don't have the particular option of weighted Euclidean or Mahalanobis distance, unless I can fake it (for example with Mahalanobis by pre-transforming by the covariance matrix), I'm out of luck. I guess that is an argument for using Matlab, right Will? :)

Price Compare said...

This is very good article for me.
Lowest price Television

Tina Mortin said...
This comment has been removed by the author.
Dean Abbott said...

Just posted this reply to a related question about k-means clustering on Data Science Central, and thought it would be a good addition here (http://www.analyticbridge.com/forum/topics/k-means-clustering), where the question was: "How is 'k' determined in k-means clustering (using FASTCLUS)?"

If there is not an operational definition for the number of clusters, yes, you have to figure this out yourself. You can use an algorithm to figure it out, but how do you know the algorithm is trading off the # clusters vs. compactness the way you want?

You have to have some idea of what you want, of course, but usually in my consulting engagements where k was unknown we would do the following.

1) interpret the clusters
there are two ways to interpret clusters. First, we compute the mean values of all the input variables to get the gist of where the clusters are centered. (normalizing the input variables can greatly influence the formation of clusters and these mean values. I have a blog post on this topic here (http://abbottanalytics.blogspot.com/2009/04/why-normalization-matters-with-k-means.html).

the second way is to compute how the clusters differ from one another. You can compute the mean values of every variable in the clusters, but it could be that all the variables except one have the same mean for every cluster--it's just one variable that is really responsible for driving the formation of the clusters. But how do you easily find these differences, especially when you have perhaps dozens of input variables?
You can eyeball it, but that can be tedious and was to get wrong. I prefer to find this algorithmically. How? By using decision trees to predict the cluster label from the same inputs. (after all, the one thing that stood in our way of doing supervised learning in the first place was that we didn't have labels for the data. now that we have clusters, we have record labels!) The tree doesn't have to be perfect, just get the gist of the differences for you to understand the key differences between clusters.

2) overlay the clusters with another measure of interest
If you have another important variable that is important, even if that variable was not included in the cluster analysis, if you compute it's mean value (or IQR) for each cluster, you can get a sense for what the clusters may mean operationally. For example, if you are computing clusters of customers, you can overlay demographics on top of the clusters (age, income, home value, etc.). Or, when I built fraud related clusters where we had so few adjudicated fraud cases that we couldn't build supervised learning models, we can still overlay the fraud label even for the relatively few cases we have to get a sense for which clusters those fraudulent transactions landed in.

So there are huge differences between selecting the number of clusters based on operational concerns vs. numeric concerns. In the latter case, cluster compactness and separation may not be the most important aspects to consider. (though sometimes they may be too!).

Unknown said...

apple safari download
Internet Explorer 11 Download
Mozilla firefox 2017 free download

Unknown said...

mozilla firefox 2017
AVG antivirus 2017
apple safari download


Unknown said...

Avast Antivirus free 2017
Google Chrome Download 2017
Download Google chrome 2017

Unknown said...

Avast Antivirus free 2017
Google Chrome Download 2017
Download Google chrome 2017

Unknown said...

mozilla firefox 2017 free download
AVG free download 2017

Prachi Sharma said...

Analogica is one of the best Big Data solutions provider in india, offers speed, reliability and economically viable solutions and Big Data Solutions Provider
.Every project is undertaken by team of highly skilled data scientists.

Anamika Sharma said...


Amazed to be here and want to read this amazing info. Thanks for this interesting article and i would really want to say thanks for this info.

hifi escorts in Dwarka

Jigar Pandya said...

Nice post. thanks to share with us.
UAN Login Guide

Munwar .sheik said...

Ramadan kareem 2017 wallpapers


Ramadan kareem wallpapers


Ramadan mubarak images


eid ul fitr greetings


eid mubarak greetings 2017


eid greetings for 2017


eid mubarak dp


http://www.ramadankareem2017quotes.com

{*JDK *}java jdk download for windows 8 64 bit

eid ul fitr essay for class 1,2,3,4,5,6


Latest Openings In BPO

Flappy Bird APK

Manpreet Kashyap said...

uan employee portal
epf balance check
pf balance check
UAN LOGIN
employee uan login
uan employee login
uan login for employees
EPFO UAN LOGIN
uan activation
uan number activation
uan login activation
epf passbook download uan
uan member portal passbook
UAN Passbook
EPF Passbook
Check Your UAN Status
check pf status by uan number
Know your UAN Status
Know your UAN from EPF Number
Check Your EPF Balance

Miriam Steve said...

I am very greatly honored to see information on Why normalization matters with K-Means, something that's very new and which i must have been missing a lot. I can boldly say that the work you have done is great and you really deserve commendation. If you feel the need to hire the best Website Content Managing Service, then check the link and you will be provided with the necessary information.

Windows Beginner said...

You are so interesting! I don't believe I've truly read through anything like that before.
change the default search engine in Microsoft Edge
critical process died windows 10

karamath sheik said...

Hai,

Check out our now venture called https://supportsocial.in/2017/08/19/top-9-social-media-marketing-tips-for-small-businesses/

badmanlive said...


Game Guardian is an amazing game hack/alteration tool. Game guardian helps you modify money, HP, Sp and more aspects of the game.
Game Guardian apk

badmanlive said...

It's surprising you aren't more popular given that you definitely possess the gift.
live stream ipl auction
ipl 11 auction live

Renci Jain said...

I offer discreet companionship support and light at the end of tube to I understand to equal and Awakens by each one of an upscale individual and devote a few position predate together. Barring no one issues in the two ways.
Kindly check out my partner website:-
Mumbai Escorts
Goa Escorts
Jaipur Escorts
Gurgaon Escorts
Jaipur Call girls
Call girls in Goa
Call girls in Mumbai
Nainital Escorts
Goa Call Girls

spybot windows 10 said...

Very Interesting and wonderfull information keep sharing
spybot search and destroy review 2017

Ogen Infosystem said...
This comment has been removed by the author.
APURVA JAIN said...

Good post.
Big Data Training Classes in Jaipur

Escorts in Dubai said...

Dubai International Escorts provides several female escorts services in Dubai

Dubai Female Escorts
Femael Escorts in Dubai
Best Escort in Dubai
Dubai high profile escorts
Dubai Call Girls
Dating Escort in Dubai
Dubai Dating Escort
High Class Escort in Dubai
Air hostess Dubai Escort
Dubai Air hostess Escorts
Hi fi Dubai Escort
High Society Escort in Dubai
Escorts services in Dubai
Dubai Escorts Services
Dubai Escorts Agency
Escorts Agency in Dubai
Independent Escorts in Dubai
Housewife Escorts in Dubai
Celebrity Escorts in Dubai
Student Escorts in Dubai
Indian Escorts in Dubai
Call Girls in Dubai
Hot Dubai Escorts
Dubai Hot Escorts
Sexy Escorts in Dubai
Dubai Sexy Escorts
Dubai International Escorts
International Escorts Dubai

Click here for more information at - www.internationalescortsdubai.com
Call and Whatsapp at - +91-9967566114

gurgaon escort said...

Call Girls In Gurgaon,Most of the individuals in Escorts Service In Gurgaon, Hotel Service work and spend more time 24Hrs Female Services. gurgaon Escort

dwarka escort said...

Young College Dwarka Call Girls: The famous primitive category of our girls popular among all ages of Men. They are the dedicated young escort who loves to explore new person at peak heights. dwarka escort

Unknown said...

Well done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful R Programming tutorial on youtube so you can check that too for more detailed knowledge on R Programming.https://www.youtube.com/watch?v=gXb9ZKwx29U

Shilpa Escorts said...


What buddies would be you really seeing that you are not going to find in to some other location, so it wasn't simple work to individuals personally, so as to locate them we now have wasted several years at the hunting of these and then we'll discover some intelligent and key associates of the association, which could be your critical portion of our Gurgaon Escorts company? We’re maybe not the sole escort agency at Gurgaon Read More :- Escort In Gurgaon

Hyderabad Escorts said...

On our Hyderabad Escorts accessible to book for your diversion now!
Hyderabad Escorts

Sugar Babes said...

Dubai Escorts

Dubai Escort

Escorts in Dubai

Indian Escorts in Dubai

Indian Escort in Dubai

HighProfile Escorts in Dubai

kapil sharma said...

Call Girls In Gurgaon,Most of the individuals in Escorts Service In Gurgaon, Hotel Service work and spend more time 24Hrs Female Services. Gurgaon Escorts Service

kapil sharma said...

Your Post or Site Very Good.Thanks For Shairing
Gurgaon Escorts Service
Gurgaon Female Escorts
Gurgaon Call Girls
Gurgaon Escorts
Call Girls In Gurgaon

amilu B said...

Pleasant Tips..Thanks for Sharing….We keep up hands on approach at work and in the workplace, keeping our business pragmatic, which recommends we can help you with your tree clearing and pruning in an invaluable and fit way.
python training in omr

python training in annanagar | python training in chennai

python training in marathahalli | python training in btm layout

python training in rajaji nagar | python training in jayanagar

simbu said...

I would like to thank you for your nicely written post, its informative and your writing style encouraged me to read it till end. Thanks
java training in omr

java training in annanagar | java training in chennai

java training in marathahalli | java training in btm layout

java training in rajaji nagar | java training in jayanagar

SRI said...

Thanks for the informative article. This is one of the best resources I have found in quite some time. Nicely written and great info. I really cannot thank you enough for sharing.

rpa training in marathahalli

rpa training in btm

rpa training in kalyan nagar

rpa training in electronic city

rpa training in chennai

rpa training in pune

rpa online training

Ram priya said...

Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.
Data Science Training in Chennai
Data science training in bangalore
Data science online training
Data science training in pune
Data science training in kalyan nagar
Data Science with Python training in chenni

kanchi sandeep said...



Nice blog..! I really loved reading through this article. Thanks for sharing such a amazing post with us and keep blogging...


Best Data Science online training in Hyderabad

Data Science training in Hyderabad

Data Science online training in Hyderabad

Geetha Devi said...

I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!
Best Machine Learning Training courses | best machine learning institute in chennai | Machine Learning course in chennai

sai said...

I am so proud of you and your efforts and work make me realize that anything can be done with patience and sincerity. Well, I am here to say that your work has inspired me without a doubt.
python training in rajajinagar
Python training in btm
Python training in usa

gowsalya said...

Great post! I am actually getting ready to across this information, It’s very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.
DevOps online Training

Saumya Giri said...

Really great post.I enjoy to reading this article.There is such a true article there is more information for us .thanks for sharing this lovely post.
Call Girls In Uttam Nagar

Call Girls In Janakpuri
Call Girls In Tilak Nagar
Call Girls In Subhash Nagar
Call Girls In Tagore Garden
Call Girls In Rajouri Garden
Call Girls In Ramesh Nagar
Call Girls In Moti Nagar
Call Girls In Shadipur
Call Girls In Patel Nagar
Call Girls In Rajendra Place
Call Girls In Karol Bagh
Call Girls In Rajiv Chowk
Call Girls In Anand Vihar
Call Girls In Vaishali
Call Girls In Dwarka
Call Girls In Paschim Vihar
Call Girls In Punjabi Bagh
Escorts in Delhi
Delhi Escort
Call Girls In Rohini
Call Girls In Kirti Nagar
Call Girls In Laxmi Nagar
Call Girls In Kaushambi

Revathy A said...

Good Post, I am a big believer in posting comments on sites to let the blog writers know that they ve added something advantageous to the world wide web.

angularjs-Training in tambaram

angularjs-Training in sholinganallur

angularjs-Training in velachery

angularjs Training in bangalore

Teju Teju said...

The blog is so interactive and Informative , you should write more blogs like this Big Data Hadoop Online course

vijay antony said...


Howdy, would you mind letting me know which web host you’re utilizing? I’ve loaded your blog in 3 completely different web browsers, and I must say this blog loads a lot quicker than most. Can you suggest a good internet hosting provider at a reasonable price?
Amazon Web Services Training in OMR , Chennai | Best AWS Training in OMR,Chennai
Amazon Web Services Training in Tambaram, Chennai|Best AWS Training in Tambaram, Chennai

Eminet It Info said...

I am really enjoying reading your well-written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
Hadoop course in Marathahalli Bangalore
DevOps course in Marathahalli Bangalore
Blockchain course in Marathahalli Bangalore
Python course in Marathahalli Bangalore
Power Bi course in Marathahalli Bangalore

Escorts Agency said...


If you're interested in finding an top class Phone Ladies in Jaipur who will really meet your fantasies telephone +919109914991 for finest Escort service in jaipur we're supplying a true time romantic call girl in jaipur encounter which you may never get out of some other escort in jaipur company.

escorts service in jaipur
escorts service in jaipur
call girl in jaipur

Call girl in ajmer
call girl in mount abu

Gilbert Maina said...

I have gone through the post and I must admit that I like it. Keep doing the great work. I will definitely have to bookmark this page for future reference. I will also refer my friends to come and have the same experience as I did. Thanks so much. Anyone looking for Research papers help check out our Sample Essays and the assignments we have done before

Sadhana Rathore said...

Wonderful post. Thanks for taking time to share this information with us.
ReactJS Training in Chennai
ReactJS Training
ReactJS Training near me
ReactJS course
ReactJS Certification
ReactJS course in Chennai

gowthunan said...

I like it and help me to development very well. Thank you for this brief explanation and very nice information. Well, got a good knowledge.
industrial course in chennai

Anbarasan14 said...

Excellent blog!!! I got to know the more useful information by reading your blog. Thanks for posting this blog.

IELTS Course in Mumbai
IELTS Institute in Mumbai
Best IELTS Coaching Classes in Mumbai
IELTS Coaching Center in Mumbai
Best IELTS Classes in Mumbai
IELTS Classes near me
IELTS Courses in Mumbai

jenifer irene said...

It was really an interesting blog, Thank you for providing unknown facts.
Air hostess training in Bangalore
Air hostess academy Bangalore
air hostess training academy
air hostess institute

Nadeem Malik said...

hi, nice information is given in this blog. Thanks for sharing this type of information, it is so useful for me. nice work keep it up. best digital marketing company in delhi

Nadeem Malik said...

Powerful Video Content Marketing Ideas for Your Business

Akash Thakur said...

tile bonder manufacturer in delhi

Vikash Kumar said...

Laminated Doors manufacturer in hubli
Thanks for giving great kind of information. So useful and practical for me. Thanks for your excellent blog, nice work keep it up thanks for sharing the knowledge.

Amase LED Lighting said...

led lawn lights in delhi
Thanks for giving great kind of information. So useful and practical for me. Thanks for your excellent blog, nice work keep it up thanks for sharing the knowledge.