Thursday, February 01, 2007

When some models are signficantly better than others

I'm not a statistician, nor have I played one on TV. That’s not to say I’m not a big fan of statistics. In the age-old debate between data mining and statistics, there is much to say on both sides of the aisle. While much of this kind of debate I find unnecessary, and conflicts have arisen as much over terminology rather than the actual concepts, there are some areas where I have found a sharp divide.

One of these areas is the idea of significance. Most statisticians who excel in their craft that I have spoken with are well-versed in discussions of p-values, t-values, and confidence intervals. Most data miners, on the other had, have probably never heard of these, or even if they have, never use them. Aside from the good reasons to use or not use these kind of metrics, I think it typifies an interesting phenomenon in the data mining world, which is the lack of measures of significance. I want to consider that issue in the context of model selection: how does one assess whether or not two models are different enough so that there are compelling reasons to select one over the other?

One example of this is what one sees when using a tool like Affinium Model (Unica Corporation)—a tool I like to use very much. If you are building a binary classification model, it will build for you, automatically, dozens, hundreds, potentially even thousands of models of all sorts (regression, neural networks, C&RT trees, CHAID trees, Naïve Bayes). After the models have been built, you get a list of the best models, sorted by whatever metric you have decided (typically area under the lift curve or response rate at a specified file depth). All of this is great. The table below shows a sample result:

Model.........Rank..Total Lift....Algorithm

NeuralNet1131...1....79.23%....Backpropagation Neural Network
Bayes236........7....78.50%....Naive Bayes

Yes, the Neural Network model (NeuralNet1131) has won the competition and has the best total lift. But the question is this: is it significantly better than the other models? (Yes, linear regression was one of the options for a binary classification model—and this is a good thing, but a topic for another day). How much improvement is significant? There is no significance test applied here to tell us this. (to be continued…)


MineThatData said...

Good observations ... in the ten models you list, if they were individually tested on an independent dataset, you'd probably find that each one worked equally well.

I've often found that as you get more specific with variables, and more technical (neural net vs. OLS), you get a model that is less able to derive good results when operating in the real world, on real customers/data.

That being said, I haven't come up with a measure, nor found a measure, that connects how well the model fits to how well the model scores an actual, live, independent set of similar data (i.e. holdout sample).

Dean Abbott said...

I'll comment further on these models next week, but there are similarities and differences. :)

Regarding how many variables to include in a final model, I've seen techniques that are all over the map. Usually I let the predictive accuracy determine this number, but not always. Too few variables and you have greater risk if any of those variables change characteristics. Too many and the models are harder to interpret and explain.

Will Dwinnell said...

I think that this is a matter of what question is being asked. Keep in mind the difference between models and modeling procedures which produce models.

I submit that error resampling is used to test the modeling procedure, not a single model (or set of models).

Wrapping the modeling procedure (everything: feature selection, predictor transformation, model fitting, etc.) within the error resampling process leaves the analyst with the answer to the question, "How well will a model perform on the population at large, if constructed by the modeling procedure in question?"

I argue that the error resampling approach does not say how much better the resulting model is than any other model, since there is no "second best model" or any other model being considered.

In your example, I assert that the data miner should treat selection of the modeling algorithm (regression, neural network, etc.) as another parameter to be optimized. This selection should happen mechanically, and within the error resampling outer loop.

From my perspective, the question, "How much better is the resulting model than a model built using some other procedure" is academic. The resulting model should perform as well in the future as error resampling indicates.

If one were concerned about the complexity cost paid for comparatively small performance gains, then that should be part of the performance measure being assessed by the error resampling process.

Dean Abbott said...

It is rare that I disagree with Will, but here's is one. Most likely, there are semantic differences and not substantive differences, but just in goes!

Will states, "'How much better is the resulting model than a model built using some other procedure'" is academic. The resulting model should perform as well in the future as error resampling indicates." I disagree, and in fact, I think this is exactly the question we most want to answer.

Let me put it another way. When I have a data set and want to built a predictive model, and if I don't have a particular algorithm I must use, I will build models using several algorithms, and different parameters within the algorithms (change architecture for neural networks, perhaps change splitting criteria for decision trees, change candidate inputs, etc.) This is the premise behind the list of model results shown in the original post. Understanding the differences between these models is not an academic excercise at all. I want to know two answers: 1) are the models and their predicted performance believable, and 2) is the relative performance of the models significantly different. In other words for item 2, do I care which model I should select?

So what then to make of the final comment, "If one were concerned about the complexity cost paid for comparatively small performance gains, then that should be part of the performance measure being assessed by the error resampling process."? It is the word "small" that is interesting. How "small" is too "small" to be concerned about? At what point do the differences in performance become insignificant? This is the crux of the question I was trying to address, and hope I posed clearly.

I'd be happy to entertain further comments here, and hope this reply clarifies my position, even if we (Will and I) end up not agreeing!

Will Dwinnell said...

Once preliminary models are constructed, you will select among those n modeling algorithms using some criterion, for example by greatest total lift on a test set.

For the purposes of this conversation, the actual selection mechanism is not as important as the fact that some selection is being made. I argue that this larger process of utilizing various different algorithms and selecting among them is itself a modeling algorithm, albeit a more complex one. This larger process (which includes selection) is what I am saying should be tested, as opposed to the individual modeling algorithms or their resulting models. Those "micro" algorithms will be tested, but only for the purposes of selection.

There are many ways to do this work, but the approach I advocate has the advantage of being completely rigorous. Consider the implications of not taking this approach: Leaving aside questions of "best-versus-second best", what happens when none of the 10 modeling algorithms you've used perform adequately? Will you build an 11th? How will you test it? What about a 12th? This progression cannot continue indefinitely, or else some model will eventually perform well in the test set accidentally. With this open-ended approach, even a random model will ultimately "prove itself".

Rather than trying to answer your original question, I suppose I am suggesting an alternative perspective which avoids that stickiness in the first place. Of course, my approach is somewhat fatalist: by avoiding the cardinal modeling sin of revisiting the test data, we may sometimes find ourselves with no adequate model. In such a case one might reason that, assuming that the modeling process is competent, it might not be possible to build a better model with the given data (although management and clients don't like to hear this!).

Shane said...

You may be interested in this paper which discusses a number of statistical techniques for comparing algorithm performance...

Statistical Comparisons of Classifiers over Multiple Data Sets
Janez Demšar; 7(Jan):1--30, 2006.

Dean Abbott said...

Interestingly, I have been using methods cited in Dietterich's 1998 paper, referenced in your reference, as a guide. So now the cat's out of the bag! I hope to have some emprical results soon, but have been bogged down a bit by real work this week.

Your post and reference gives me more to chew on, that's for sure. Have you used either the Wilcoxon or Friedman tests in your work?

Shane said...

Yes, for one project I used the Friedman rank test to determine there was a significant difference in the first place then followed up with a Holm post test to check which algorithms had the significant improvement over a base classifier.

Will Dwinnell said...

Wow, how preachy was my last comment?

Anyway, I thought people might be interested in comments on exactly this subject made by Breiman in his esay Statistical Modeling: The Two Cultures (see specifically, the section titled Rashomon and the Multiplicity of Good Models):

Statistical Modeling: The Two Cultures