Tuesday, December 07, 2004

Find Correlated Variables Prior to Modeling

Many data sets contain highly correlated variables that measure the same kind of information in different ways. Or, when in-house data is appended with third-part data (census data, for example), the same problem often occurs. Some algorithms will build unstable models if two or more highly correlated variables are included in the model, and others will just slow down. Either way, it is a good idea to remove highly (linearly) correlated variables. But how do you identify them and remove them?

Frequently, data mining software packages allow you to measure correlation between variables, but they don’t typically allow you to select a variable subset based on some correlation threshold. A trick to use when dealing with relatively small data sets that can fit into Excel is to do the following. Export a snippet of the real-valued columns of data as tab or comma delimited, and load it into Excel. Use the correlation data analysis option to create the correlation matrix. Then use the conditional formatting option in Excel to highlight the cells where high correlations occur as one color (green), medium correlations as a second color (orange), and low correlations as a third color (blue). Typically I use logic like “if the cell value is not between 0.9 and –0.9, color the cell green.”

Once the cells are color coded, one typically sees blocks of data that are highly correlated with one another. The threshold depends on the application, but I typically use +/- 0.9 as a threshold. Only one of those variables is needed to represent that idea in the model; remove the others from the list of candidate inputs to the model. This process can remove half or more of the variables from consideration without losing the ability to build reliable models. Additionally, the visual correlation matrix provides insights into variable groupings not readily available without doing some kind of factor analysis or principal component analysis.

A sample correlation matrix is shown below.









Bands 3, 4, and 5 are correlated with each other above the 0.9 level
Bands 8, 9 and 10 are correlated with each other above the 0.9 level
Bands 11 and 12 are correlated with each other above the 0.9 level

Therefore, one may want to remove bands 4, 5, 9, 10, and 12 from the candidate input list.

16 comments:

Anonymous said...

Hello I have a question regarding removing variables before modelling.

When removing variables, should the significance of correlation play any role in whether or not I remove the variables? Or does removing variables solely depend on the correlation coefficient?

Should I choose a range that I do not want my coefficients to exceed? Can I say for example that I want to remove variables such that none of the coefficients between independent variable falls outside (-0.2,0.2)? Is this the way to remove variables? By choosing an allowable coefficient range and removing based on that? Or does the significance of the coefficient play a role in removing them?

If a coefficient is 0.06 with a significance of 0.005, should the fact that the correlation is significant at this low of a level impact my decision to remove one of the two variables? Or is this just simply telling me that it is 99.5% certain that this is how the two variables will interact if the entire population is analyzed?

I apologize for trying to explain so in detail, but I cannot find the answer anywhere (probably because I'm just dumb and its super obvious).

Dean Abbott said...

a few comments...

first, the correlation filtering as I envisioned it here is intended to remove redundant variables, that is, variables with extremely high correlations magnitudes that indicate the pairs of variables contain the same information. Correlations lower than 0.9 or maybe at the low end, 0.8, I wouldn't touch using this method.

signficance is largely a result of data size (N). If you have a correlation coefficient of 0.9 but at a low significance, it is because there aren't many records to begin with; that's really the bigger problem! So for most of the data I interact with, eveything is significant.

But that does't answer your question. :) If I saw a pair of variables with high correlation and low significance, I would probably still remove one of the two because even if the likelihood that the correlation is true is low, in the data I have right now it's still a high correlation. Algorithms will still be effected by the high correlation (collinearity in linear regression, bias in kmeans, knn and pca models, etc.) So even if I were to bootstrap or crossvalidate, the data is still a problem that needs to be addressed.

Dean

Price Comapre said...

It is a good way to find.
Television Price Compare

Tina Mortin said...
This comment has been removed by the author.
Anamika Sharma said...


I am very sympathetic to your viewpoint. It is very deep and meaningful.
hifi call girls in Delhi

nirmala ashwin said...

5 star hotel facilitie http://www.russiongirlsinmahipalpur.in Call Girls in Mahipalpur

Zoya Qureshi said...
This comment has been removed by the author.
Zoya Qureshi said...
This comment has been removed by the author.
Escort Gurgaon said...

If you want some enjoyment in life and want to spend some time or a night with a beautiful hot girl in Mahipalpur then here we are Mahipalpur Escort, we are one of the best escort service providers in Mahipalpur because of the quality of classy escort girls we provide. Customers satisfaction is our agency motto and our Mahipalpur Escorts are always available for you 24/7 incall/outcall both.
Visit - http://www.escort-gurgaon.com/mahipalpur-escort.html

Escort Gurgaon said...

Escort Gurgaon provide you Independent Call Girls In Gurgaon
http://www.escort-gurgaon.com/independent-call-girls-in-gurgaon.html

Patrachar School said...

Patrachar Vidyalaya Delhi - Patrachar School was established in 1999 for those students who drop out from schools due to any reason.
Patrachar Vidyalaya

Patrachar School said...

Patrachar Vidyalaya Delhi - Patrachar School was established in 1999 for those students who drop out from schools due to any reason.
Patrachar Vidyalaya
Patrachar Vidyalaya Shalimar Bagh


Neha Singh said...

Escort Service in Gurgaon
Do you need a Call Girls in Gurgaon to give you the real pleasure of your life, well, if you revert in affirmative, in that case, you are at the right place. We are a stupendous track record of providing the best girls that would completely take the evening of yours by storm, so to avail them all you need to do is just call and we will provide the best at the right price.

Dipika Padukon said...

Call Girls in Gurgaon
Are you looking for escort service in Gurgaon, we provide the best that would completely transform your experience. Our escorts are trained meticulously to deliver best service that you are looking for.

Roshni Kumari said...

Female Escort in Gurgaon
If after a busy business tour, you are looking to ignite the evening then our Call Girls in Sohna Road are there to help you in the best way. We provide the service that would completely take away your work stress and revive you from within. So, why wait when you can get the best feeling of love at cost-effective prices. The escort service that we deliver in Sohna Road is among the best that you can get at the rates that we offer. We provide the best girls that would completely satiate you from each and every angle. So, visit us today and take your experience to a new level.

Annu Devi said...

Call Girls in Mahipalpur
If you have come from the West and you are captivated by the Asian beauty, in that case, we make sure that you get the best Call Girls in Aerocity to help you taste the Asian flavor in the best way. At Aerocity, we make sure that you are getting one heck of an experience with our girls. So, no matter whether it is official meeting or bachelor party, we are there for all your needs.