tag:blogger.com,1999:blog-5652924.post2930356489729696620..comments2024-03-02T01:02:21.655-08:00Comments on Applied Data Science and <br>Machine Learning: Why normalization matters with K-MeansDean Abbotthttp://www.blogger.com/profile/16818000233889520746noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-5652924.post-53945529326408634702015-12-05T20:20:43.316-08:002015-12-05T20:20:43.316-08:00Just posted this reply to a related question about...Just posted this reply to a related question about k-means clustering on Data Science Central, and thought it would be a good addition here (http://www.analyticbridge.com/forum/topics/k-means-clustering), where the question was: "How is 'k' determined in k-means clustering (using FASTCLUS)?"<br /><br />If there is not an operational definition for the number of clusters, yes, Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5652924.post-59676425146008621092009-12-23T10:45:11.626-08:002009-12-23T10:45:11.626-08:00nvasil: Thanks for the comment and link; I will be...nvasil: Thanks for the comment and link; I will be looking it up. The biggest problem for me is that I usually use commercial tools, and if they don't have the particular option of weighted Euclidean or Mahalanobis distance, unless I can fake it (for example with Mahalanobis by pre-transforming by the covariance matrix), I'm out of luck. I guess that is an argument for using Matlab, rightDean Abbotthttps://www.blogger.com/profile/16818000233889520746noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-5649947851576412272009-12-22T07:39:15.965-08:002009-12-22T07:39:15.965-08:00This is a deeper problem. Most of the time euclide...This is a deeper problem. Most of the time euclidean metric is not the correct. The weighted eclidean metric is the right one. There are some pretty good methods like Kmeans with metric learning. <br /><br />Also you can find a fast implementation of kmeans here <br />http://www.analytics1305.com/documentation/algorithm_reference.html<br />It has weighted metric (although it is still under nvasilhttps://www.blogger.com/profile/09970193239160690578noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-49093487439568011542009-07-16T06:14:04.231-07:002009-07-16T06:14:04.231-07:00Hi friends,
I have written a blog on Data Preproce...Hi friends,<br />I have written a blog on Data Preprocessing emphasizing its significance through a simple example for Normalization. Please do visit and leave your comments.<br /><br />http://intelligencemining.blogspot.com/2009/07/data-preprocessing-normalization.htmlAnonymoushttps://www.blogger.com/profile/04403952227020477147noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-3500777545093998322009-04-11T13:53:00.000-07:002009-04-11T13:53:00.000-07:00Highly correlated variables make it difficult to u...Highly correlated variables make it difficult to use k-means directly. Essentially what happens is that you have the same sort of scaling problem as you are discussing, but aligned at some angle relative to the original axes. Getting right of that problem requires not just normalizing each axis independently, but using an correlation matrix to realign and scale the axes.<BR/><BR/>Here is some RTed Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-77631656171877670372009-04-06T12:57:00.000-07:002009-04-06T12:57:00.000-07:00This is a good point, Dean. I have wondered how m...This is a good point, Dean. I have wondered how much difference it might make to use different normalizations. The standard, of course, is subtract-and-divide to get zero-mean and unit-standard deviation, but some distributions are poorly characterized by these summaries. I have used subtract-and-divide to get zero-median and unit-IQR, but there are plenty of other options.Will Dwinnellhttps://www.blogger.com/profile/03379859054257561952noreply@blogger.comtag:blogger.com,1999:blog-5652924.post-49673641447467639172009-04-03T08:20:00.000-07:002009-04-03T08:20:00.000-07:00FYI: I left a comment on this topic at the origina...FYI: I left a comment on this topic at the <A HREF="http://www.kdkeys.net/forums/8781/ShowThread.aspx#8781" REL="nofollow">original site</A> with some specifics on implementation of normalization in Clementine.Dean Abbotthttps://www.blogger.com/profile/16818000233889520746noreply@blogger.com