Tuesday, June 07, 2005

Beware of Being Fooled with Model Performance

Interpreting model performance is a minefield. If one wants model performance to be as good as possible, it is critical to define exactly what "good" means. How does one measure "goodness"? The easiest way to communicate performance is with a single-valued score, such as percent correct classification or R-squared. However, it is precisely this simplification of a complex idea the model is predicting to a single number that can cause one to be fooled. A simple example follows.

Let's assume that a non-profit organization wants a model built that predicts the propensity of individuals to send donations, and that this model has 80+% classification accuracy, even on a test set. Furthermore, assume that the two indicators "Recent Donation Amount" (X1) and "Average Donation Amount" (X2) are two of the top predictors in the model. The figure at the left shows what a Support Vector Machine model did with this data. Even with the good accuracy, there is something disturbing about the model that isn't clear unless one sees a picture: the model isn't finding ranges of average and recent donation amounts that are associated with donors, but rather it is finding islands of donors. The second model (on the right) provided corrective measures to smooth the model, and it much more pleasing. It is saying (roughly) that when someone donates between about $10-$50 on average (X2), they are more likely to respond. It is smooth and there are no pockets of isolated donation amounts, making this model much more believable, even though some accuracy was lost in the process.







Thursday, May 05, 2005

Gain Insights by Building Models from Several Algorithms

A frequent question in data mining is "which algorithm performs best"? While I would argue that getting the data right is far more important than which algorithm is used, there are differences in algorithms that the analyst can use to his or her advantage.

Decision trees build rules to identify homogeneous groups of data, (essentially, little rectangular chunks), so it is building a whole set of local descriptions of the data. In contrast to this, regression builds a single global model of all the data. Neural networks are somewhere in between: a single global model is built, but it can be highly nonlinear, including finding local pockets of homogeneous data. It can be useful, therefore, to look at these different styles of models to understand general trends (regression), and local behaviour (trees). It is often the case that different algorithms arrive at the same level of performance in different ways. Seeing these differences can provide additional insight into the business processes underlying the models.

Friday, April 15, 2005

Beware of Automatic Handling of Categorical Variables

Categorical variables are often difficult to use because many data mining algorithms require that input (independent) variables be continuous. Fortunately many data mining software tools handle this problem for you by converting the single categorical variable with "N" values into "N" new dummy variables, one new variable for each value. For example, if you have a field "State" with 50 text labels, the tools will create automatically 50 new variables with values 0 or 1. If a record is has the value "MA" in the variable State, the new dummy column representing "MA" will have value "1", and all 49 other state dummy columns will have value "0". Because of this, analysts don't have to convert all their text and categorical variables to numeric variables prior to modeling.

However, the automatic handling of categorical variables could cause problems that are hidden to you. Instead of having one input variable in your model (as it appears when you select input variables), you could have hundreds! This can effect decision trees (that are biased toward variables with more categories) and neural network sensitivities (that are often biased toward categorical variables with large numbers of categories). In other words, there is a hidden bias toward larger numbers of categories that could bias your interpretation of the models.

What should one do? First, be aware of these variables. During the data understanding stage of your data mining project, identify variables with large numbers of categories. This will at least alert you to the possiblity of bias in your models or sensitivies. Second, If there are more than a dozen or two categories, consider binning up those variable groups by combining dummy variables with smaller counts into larger groups, or dropping them altogether. More on identifying the significance of categorical variable values in an upcoming Abbott Insights.

Monday, March 14, 2005

Use Priors to Balance Class Counts

One well-known difficulty in building classification models occurs when one class vastly outnumbers the other classes. For example, if the output (target) variable has 95% 0s and 5% 1s, a neural network could predict every record will be 0 and have 95% accuracy. Of course, this is a meaningless model. This occurs when there are contradictions in the data, that is, when there are input patterns in the data with output patterns containing both 0s and 1s. If there are more records with an output variable value equal to 0, the classifier will choose 0 as the more likely answer.

The most common correction to make when building neural networks for data with a large imbalance of class counts is to merely balance the counts of 0s and 1s by removing records with excess 0s, or by duplicating records with 1s. That issue will be covered in a future issue of Abbott Insights™.

However, some algorithms have a way to accomplish this balancing without sampling by specifying the expected prior probability of each class value (priors). The CART decision tree algorithm is one algorithm with settings to do this. The advantage is that no data is thrown away, yet the classifier won’t favor the overrepresented class value over the underrepresented one.

Monday, February 07, 2005

Create Three Sampled Data Sets, not Two

One often sees an appeal to split data into two data sets for modeling: a training set and a testing set. The training set is used to build a model, and the testing set is used to assess the model. If the model accuracy on the training set is good, but on the testing set is poor, one has a good indication that the model has been overfit, or in other words, the model has picked up on patterns in the modeling data that are specific to the training data. In this case, the best course of action is to adjust parameters in the modeling algorithm so that a simpler model is created, whether it means fewer inputs in a model (for neural networks, regression, nearest neighbor, etc.), or fewer nodes or splits in the model (neural networks or decision trees). Then, retrain and retest the data to see if results have improved, particularly for the testing data.

However, if one does this several times, or even dozens of times (which is common), the testing data ceases to be an independent assessment of model performance because the testing data was used to change the inputs or algorithm parameters. Therefore, it is strongly recommended to have a third dataset to perform a final validation. This validation step should occur only after training and testing have provided confidence that the model is good enough to deploy.



Tuesday, January 04, 2005

Beware of Outliers in Computing Correlations

Outliers can cause mislead summary statistics in a variety of ways. One such way is when computing correlations, and it doesn’t take many outliers to significantly change correlation coefficients. Take, for example, 1,000 random samples of a variable X, uniformly distributed over the range [0,1]. A second variable, Y, is a combination of X and a second uniform random sample so that X and Y have a correlation coefficient of 0.99. Now, suppose in the 1,000 data points, two outliers are introduced. The values of X are the same, but the values of Y are magnified by a factor of 10, and are placed away from the trend of the original data points. The left figure below shows the strong correlation of X and Y, and the figure to the right shows the same data with two outliers.








The statistics for the four variables are shown in Table 1 below. When one looks at the correlations, however (Table 2), the correlations suddenly change from 0.99 down to 0.49.








Therefore, test your data for outliers that could be influencing summary statistics. More on how to do that in a future issue of Abbott Insights™.