This post was first posted on Predictive Models are not Statistical Models — JT on EDM
My friend and colleague James Taylor asked me last week to comment on a question regarding statistics vs. predictive analytics. The bulk of my reply is on James' blog; my fully reply is here, re-worked from my initial response to clarify some points further.
I have always love reading the green "Sage" books, such as Understanding Regression Assumptions (Quantitative Applications in the Social Sciences)
or Missing Data (Quantitative Applications in the Social Sciences) because they are brief, cover a single topic, and are well-written. As a data miner though, I am also somewhat amused reading them because they are obviously written by statisticians with the mindset that the model is king. This means that we either pre-specify a model (the hypothesis) or require the model be fully interpretable, fully representing the process we are modeling. When the model is king, it's as if there is a model in the ether that we as modelers must find, and if we get coefficients in the model "wrong", or if the model errors are "wrong", we have to rebuild the data and then the model to get it all right.
In data mining and predictive analytics, the data is king. These models often impute the models from the data (decision trees do this), or even if they only fit coefficients (like neural networks), it's the accuracy that matters rather than the coefficients. Often, in the data mining world, we won't have to explain precisely why individuals behave as they do so long as we can explain generally how they will behave. Model interpretation is often related to describing trends (sensitivity or importance of variables).
I have always found David Hand's summaries of the two disciplines very useful, such as this one here; I found that he had a healthy respect for both disciplines.