Thursday, May 01, 2014

Why Overfitting is More Dangerous than Just Poor Accuracy, Part I

Arguably, the most important safeguard in building predictive models is complexity regularization to avoid overfitting the data. When models are overfit, their accuracy is lower on new data that wasn’t seen during training, and therefore when these models are deployed, they will disappoint, sometimes even leading decision makers to believe that predictive modeling “doesn’t work”.

Overfit, however, is thankfully a well-known problem and every algorithm has ways to avoid it. CART® and C5 trees use pruning to remove branches that are prone to overfitting, CHAID trees require splits are statistically significant to add complexity to the trees. Neural networks use held-out data to stop training when accuracy on held-out data becomes worse. Stepwise regression uses information theoretic criteria like the Akaike Information Criterion (AIC), Minimum Description Length (MDL), or the Bayesian Information Criterion (BIC) to add terms only when the additional complexity is offset by enough reduction of error.

But overfitting has more problems than merely misclassification cases in holdout data or incurring large errors for regression problems. Without loss of generality, this discussion will only describe overfilling in classification problems, but the same principles apply in regression problems as well.

One way modelers reduce the likelihood of overfit is to apply the principle of Occam’s Razor, where if two models exhibit the same accuracy, we will prefer the simpler model because it is more likely to generalize well. By simpler, we must keep in mind that we prefer models that behave more simply rather than models that just appear to be simpler because they have fewer terms. John Elder (a regular contributor to the PA Times) has a fantastic discussion of that topic in the book by Seni and Elder, Ensemble Methods in Data Mining.

Consider this example contrasting linear and nonlinear models.  The figure below shows decision boundaries for two models separates two classes of the famous Iris Data (http://archive.ics.uci.edu/ml/datasets/Iris). On the left is the decision boundary from a linear model built using linear discriminant analysis (like LDA or the Fisher Discriminant) and on the right, a decision boundary built by a model using quadratic discriminant analysis (like the Bayes Rule). The image can be found at http://scikit-learn.org/0.5/auto_examples/plot_lda_vs_qda.html.

It appears that the accuracy of both models is the same (let’s assume that it is), yet the behavior of the models is very different. If there is new data to be classified that appears in the upper left of the plot, the LDA model will call the data point versicolor whereas the QDA model will call it virginica. Which is correct? We don’t know which would be correct from the training data, but we do know this: there is no justification in the data to increase the complexity of the model from linear to quadratic. We probably would prefer the linear model here.



Apply models to regions in the data without data is the entire reason for avoiding overfit. The issue with the figure above was with model behavior when doing extrapolation, where we want to make sure that the models behave in a reasonable way for values outside (larger than or smaller than) the data used in training. But models also need to behave well when they interpolate, meaning we want models to behave reasonably for data in between data that exists in the training data.

Consider the second figure below showing decision boundaries for two models built from a data set derived from the famous KDD Cup data from 1998. The two dimensions in this plot are Average Donation Amount (Y) and Recent Donation Amount (X). This data tells the story that higher values of average and recent donation amounts are related to higher likelihoods of donors responding; note that for the smallest values of both average and recent donation amount, at the very bottom left of the data, the regions are colored cyan.


Both models are built using the Support Vector Machines (SVM) algorithm, but with different values of the complexity constant, C. Obviously, the model at the left is more complex than the model on the right. The magenta regions represent responders and the cyan regions represent non-responders.

In the effort to be more accurate on training data, the model on the left creates closed-decision boundaries around any and all groupings of responders. The model at the right joins these smaller blobs together into a larger blob where the model classifies data as responders. The complexity constant for the model at the right gives up accuracy to gain simplicity.

Which model is more believable? The one on the left will exhibit strange interpolation properties; data in between the magenta blobs will be called non-responders, sometimes in very thin regions between magenta regions; this behavior isn’t smooth or believable. The model at the right creates a single region of data to be classified as a responder and is clearly better than the model at the left.

Beware of overfitting the data and test models not just on testing or validation data, but if possible, on values not in the data to ensure its behavior, whether interpolation or extrapolation, is believable.


In part II, the problem overfitting causes for model interpretation will be addressed.

4 comments:

Anonymous said...

I would suggest that if you wish to classify a record that appears in the top left of the first figure you cannot use either of the two models shown. The model is only relevant to the data on which it has been built. Once the data you wish to classify is out of this range then the model is no longer valid.

Dean Abbott said...

I agree with you that the models are only applicable to where the data was during training. Finding the gaps/empty areas in the decision space can be difficult though. It's easy to test model inputs and if all the inputs exceed their max value, you know the model has to extrapolate.

But if some of the inputs exceed and others don't, the data could still be in a good location. Or worse yet, you can have outliers interior to the range of the variables that is still not a stable place for model decisions. These are very difficult to find (remember that in multi-dimensional modeling, we can't look at the data and see these outliers). The second figure is a good example of this. Finding the right level of model complexity in these situations is very important.

pickettbd said...

Your title drew me in. I certainly agree that overfitting is more dangerous than poor accuracy. I would also suggest that poor accuracy isn't very dangerous, making your assertion not terribly surprising (not to say it isn't a valid point, of course). If you create a model (or your learning algorithm does it for you) and the model performs poorly, you know it performs poorly up-front. When you know your model isn't very good, you either don't use or use it with caution, rendering it relatively harmless.

Overfitting, on the other hand, is dangerous (as you've identified) because it can be difficult to detect. You've referenced a few examples of how various models avoid overfitting and I believe anyone using these models must make themselves familiar with these techniques. Fine tuning parameters for various learning algorithms will, in many cases, be domain specific - requiring the analyst to take care.

I appreciated your Iris example. I agree that the QDA model is unnecessarily complex and the LDA model is much more appropriate. I appreciate your remarks about interpolation. We certainly do want our models to behave reasonably well for data in between existing data. I suppose that is the very purpose of creating a model in the first place. I must, however, respectfully disagree with the point about extrapolation. While it feels nice to have a model behave "reasonably" for points outside the range, we have no real way of measuring - even qualitatively - whether or not it is reasonable. If overfitting is a snake hiding in the grasses of your analysis awaiting the chance to poison your results, extrapolation must be some kind hungry carnivore. It's like trying to protect yourself from the snake by using a wolf as shield.

Finally, I think you've made a great point out of the KDD Cups example. Surely the SVM results on the left (with a bunch of small groups) is not very useful or intuitive. The one on the right is much more applicable. Overall, I think you're correct: overfitting is dangerous - much more dangerous than poor accuracy.

Dean Abbott said...

thanks for your comments. You are correct that there is a bit of hyperbole going on with the title. The "dangerous" label would only be the case if the model is used, of course.

What I'm most uncomfortable with in this post is how to detect the problems. Yes, there are obvious visual cues and yes we can examine training/testing accuracy metrics (for consistency...but there is no agreed-upon standard for how different training/testing results can be before we suspect overfitting). Or better yet, if resampled data (like bootstrapped or cross-validated data) behaves consistently (accuracy), I'm much more confident as well.

I think what I'm really trying to get at and what I do in practice is stability. If different modeling algorithms predict data consistently then I'm more confident that the model will behave well. But I don't have a standard practice here...my methods change based on the algorithm (they can be unstable in different ways) and data size (bootstrapping 10M records doesn't help a lot unless the model itself is very very very complex).