Monday, March 14, 2005

Use Priors to Balance Class Counts

One well-known difficulty in building classification models occurs when one class vastly outnumbers the other classes. For example, if the output (target) variable has 95% 0s and 5% 1s, a neural network could predict every record will be 0 and have 95% accuracy. Of course, this is a meaningless model. This occurs when there are contradictions in the data, that is, when there are input patterns in the data with output patterns containing both 0s and 1s. If there are more records with an output variable value equal to 0, the classifier will choose 0 as the more likely answer.

The most common correction to make when building neural networks for data with a large imbalance of class counts is to merely balance the counts of 0s and 1s by removing records with excess 0s, or by duplicating records with 1s. That issue will be covered in a future issue of Abbott Insights™.

However, some algorithms have a way to accomplish this balancing without sampling by specifying the expected prior probability of each class value (priors). The CART decision tree algorithm is one algorithm with settings to do this. The advantage is that no data is thrown away, yet the classifier won’t favor the overrepresented class value over the underrepresented one.

Monday, February 07, 2005

Create Three Sampled Data Sets, not Two

One often sees an appeal to split data into two data sets for modeling: a training set and a testing set. The training set is used to build a model, and the testing set is used to assess the model. If the model accuracy on the training set is good, but on the testing set is poor, one has a good indication that the model has been overfit, or in other words, the model has picked up on patterns in the modeling data that are specific to the training data. In this case, the best course of action is to adjust parameters in the modeling algorithm so that a simpler model is created, whether it means fewer inputs in a model (for neural networks, regression, nearest neighbor, etc.), or fewer nodes or splits in the model (neural networks or decision trees). Then, retrain and retest the data to see if results have improved, particularly for the testing data.

However, if one does this several times, or even dozens of times (which is common), the testing data ceases to be an independent assessment of model performance because the testing data was used to change the inputs or algorithm parameters. Therefore, it is strongly recommended to have a third dataset to perform a final validation. This validation step should occur only after training and testing have provided confidence that the model is good enough to deploy.



Tuesday, January 04, 2005

Beware of Outliers in Computing Correlations

Outliers can cause mislead summary statistics in a variety of ways. One such way is when computing correlations, and it doesn’t take many outliers to significantly change correlation coefficients. Take, for example, 1,000 random samples of a variable X, uniformly distributed over the range [0,1]. A second variable, Y, is a combination of X and a second uniform random sample so that X and Y have a correlation coefficient of 0.99. Now, suppose in the 1,000 data points, two outliers are introduced. The values of X are the same, but the values of Y are magnified by a factor of 10, and are placed away from the trend of the original data points. The left figure below shows the strong correlation of X and Y, and the figure to the right shows the same data with two outliers.








The statistics for the four variables are shown in Table 1 below. When one looks at the correlations, however (Table 2), the correlations suddenly change from 0.99 down to 0.49.








Therefore, test your data for outliers that could be influencing summary statistics. More on how to do that in a future issue of Abbott Insights™.

Tuesday, December 07, 2004

Find Correlated Variables Prior to Modeling

Many data sets contain highly correlated variables that measure the same kind of information in different ways. Or, when in-house data is appended with third-part data (census data, for example), the same problem often occurs. Some algorithms will build unstable models if two or more highly correlated variables are included in the model, and others will just slow down. Either way, it is a good idea to remove highly (linearly) correlated variables. But how do you identify them and remove them?

Frequently, data mining software packages allow you to measure correlation between variables, but they don’t typically allow you to select a variable subset based on some correlation threshold. A trick to use when dealing with relatively small data sets that can fit into Excel is to do the following. Export a snippet of the real-valued columns of data as tab or comma delimited, and load it into Excel. Use the correlation data analysis option to create the correlation matrix. Then use the conditional formatting option in Excel to highlight the cells where high correlations occur as one color (green), medium correlations as a second color (orange), and low correlations as a third color (blue). Typically I use logic like “if the cell value is not between 0.9 and –0.9, color the cell green.”

Once the cells are color coded, one typically sees blocks of data that are highly correlated with one another. The threshold depends on the application, but I typically use +/- 0.9 as a threshold. Only one of those variables is needed to represent that idea in the model; remove the others from the list of candidate inputs to the model. This process can remove half or more of the variables from consideration without losing the ability to build reliable models. Additionally, the visual correlation matrix provides insights into variable groupings not readily available without doing some kind of factor analysis or principal component analysis.

A sample correlation matrix is shown below.









Bands 3, 4, and 5 are correlated with each other above the 0.9 level
Bands 8, 9 and 10 are correlated with each other above the 0.9 level
Bands 11 and 12 are correlated with each other above the 0.9 level

Therefore, one may want to remove bands 4, 5, 9, 10, and 12 from the candidate input list.

Wednesday, August 06, 2003

data mining: art or science?

Data Mining is as much art as science, and is not yet turn-key. Therefore, all those tips and tricks that one learns from experience (and are not in textbooks) can make or break a data mining project. This blog is intended to be a repository of user experiences in data mining.

If you want help with data mining, there are other sites that provide FAQs and tutorials. The links provide some locations for this. Please limit yourselves to success/failure stories and followup question.