I'm not a statistician, nor have I played one on TV. That’s not to say I’m not a big fan of statistics. In the age-old debate between data mining and statistics, there is much to say on both sides of the aisle. While much of this kind of debate I find unnecessary, and conflicts have arisen as much over terminology rather than the actual concepts, there are some areas where I have found a sharp divide.
One of these areas is the idea of significance. Most statisticians who excel in their craft that I have spoken with are well-versed in discussions of p-values, t-values, and confidence intervals. Most data miners, on the other had, have probably never heard of these, or even if they have, never use them. Aside from the good reasons to use or not use these kind of metrics, I think it typifies an interesting phenomenon in the data mining world, which is the lack of measures of significance. I want to consider that issue in the context of model selection: how does one assess whether or not two models are different enough so that there are compelling reasons to select one over the other?
One example of this is what one sees when using a tool like Affinium Model (Unica Corporation)—a tool I like to use very much. If you are building a binary classification model, it will build for you, automatically, dozens, hundreds, potentially even thousands of models of all sorts (regression, neural networks, C&RT trees, CHAID trees, Naïve Bayes). After the models have been built, you get a list of the best models, sorted by whatever metric you have decided (typically area under the lift curve or response rate at a specified file depth). All of this is great. The table below shows a sample result:
NeuralNet1131...1....79.23%....Backpropagation Neural Network
Yes, the Neural Network model (NeuralNet1131) has won the competition and has the best total lift. But the question is this: is it significantly better than the other models? (Yes, linear regression was one of the options for a binary classification model—and this is a good thing, but a topic for another day). How much improvement is significant? There is no significance test applied here to tell us this. (to be continued…)