Tuesday, December 07, 2004

Find Correlated Variables Prior to Modeling

Many data sets contain highly correlated variables that measure the same kind of information in different ways. Or, when in-house data is appended with third-part data (census data, for example), the same problem often occurs. Some algorithms will build unstable models if two or more highly correlated variables are included in the model, and others will just slow down. Either way, it is a good idea to remove highly (linearly) correlated variables. But how do you identify them and remove them?

Frequently, data mining software packages allow you to measure correlation between variables, but they don’t typically allow you to select a variable subset based on some correlation threshold. A trick to use when dealing with relatively small data sets that can fit into Excel is to do the following. Export a snippet of the real-valued columns of data as tab or comma delimited, and load it into Excel. Use the correlation data analysis option to create the correlation matrix. Then use the conditional formatting option in Excel to highlight the cells where high correlations occur as one color (green), medium correlations as a second color (orange), and low correlations as a third color (blue). Typically I use logic like “if the cell value is not between 0.9 and –0.9, color the cell green.”

Once the cells are color coded, one typically sees blocks of data that are highly correlated with one another. The threshold depends on the application, but I typically use +/- 0.9 as a threshold. Only one of those variables is needed to represent that idea in the model; remove the others from the list of candidate inputs to the model. This process can remove half or more of the variables from consideration without losing the ability to build reliable models. Additionally, the visual correlation matrix provides insights into variable groupings not readily available without doing some kind of factor analysis or principal component analysis.

A sample correlation matrix is shown below.

Bands 3, 4, and 5 are correlated with each other above the 0.9 level
Bands 8, 9 and 10 are correlated with each other above the 0.9 level
Bands 11 and 12 are correlated with each other above the 0.9 level

Therefore, one may want to remove bands 4, 5, 9, 10, and 12 from the candidate input list.


Anonymous said...

Hello I have a question regarding removing variables before modelling.

When removing variables, should the significance of correlation play any role in whether or not I remove the variables? Or does removing variables solely depend on the correlation coefficient?

Should I choose a range that I do not want my coefficients to exceed? Can I say for example that I want to remove variables such that none of the coefficients between independent variable falls outside (-0.2,0.2)? Is this the way to remove variables? By choosing an allowable coefficient range and removing based on that? Or does the significance of the coefficient play a role in removing them?

If a coefficient is 0.06 with a significance of 0.005, should the fact that the correlation is significant at this low of a level impact my decision to remove one of the two variables? Or is this just simply telling me that it is 99.5% certain that this is how the two variables will interact if the entire population is analyzed?

I apologize for trying to explain so in detail, but I cannot find the answer anywhere (probably because I'm just dumb and its super obvious).

Dean Abbott said...

a few comments...

first, the correlation filtering as I envisioned it here is intended to remove redundant variables, that is, variables with extremely high correlations magnitudes that indicate the pairs of variables contain the same information. Correlations lower than 0.9 or maybe at the low end, 0.8, I wouldn't touch using this method.

signficance is largely a result of data size (N). If you have a correlation coefficient of 0.9 but at a low significance, it is because there aren't many records to begin with; that's really the bigger problem! So for most of the data I interact with, eveything is significant.

But that does't answer your question. :) If I saw a pair of variables with high correlation and low significance, I would probably still remove one of the two because even if the likelihood that the correlation is true is low, in the data I have right now it's still a high correlation. Algorithms will still be effected by the high correlation (collinearity in linear regression, bias in kmeans, knn and pca models, etc.) So even if I were to bootstrap or crossvalidate, the data is still a problem that needs to be addressed.