Wednesday, February 21, 2007

Is Data Mining Too Complicated?

I just read an interesting post on Infoworld entitled Data Mining Donald. In it there is a very interesting comment, and I quote:
Data mining is the future, and as of yet, it's still far too complicated for the ordinary IT guy to grasp.
Is this so? If data mining is too complicated for the typical IT guy, is it also too complicated for the typical grunt analyst?

Before I comment further, I'll just open it up to any commenters here. There are other very interesting and important parts of the post as well.

Friday, February 16, 2007

Another Perspective on Data Mining and Terrorism

Recently, much has written specifically about data mining's likely usefulness as a defense against terrorism. This posting takes "data mining" to mean a sophisticated and rigorous statistical analysis, and excludes data gathering functions. Privacy issues aside, claims have recently been made regarding data mining's technical capabilities as a tool in combating terrorism.

Very specific technical assertions have been made by other experts in this field, to the effect that predictive modeling is unlikely to provide a useful identification of individuals imminently carrying out physical attacks. The general reasoning has been that, despite the magnitude of their tragic handiwork, there have been too few positive instances for accurate model construction. As far as this specific assertion goes, I concur.

Unfortunately, this notion has somehow been expanded in the press, and in the on-line writings of authors who are not expert in this field. The much broader claim has been made that "data mining cannot help in the fight against terrorism because it does not work". Such overly general statements are demonstrably false. For example, a known significant component of international terrorism is its financing, notably through its use of money laundering, tax evasion and simple fraud. These financial crimes have been under attack by data mining for over 10 years.

Further, terrorist organizations, like other human organizations, involve human infrastructure. Behind the man actually conducting the attack stands a network of support personnel: handlers, trainers, planners and the like. I submit that data mining might be useful in identifying these individuals, given their much larger number. Whether or not this would work in practice could only be known by actually trying.

Last, the issues surrounding data mining's ability to tackle the problem of terrorism have frequently been dressed up in technical language by reference to the concepts of "false positives" and "false negatives", which I believe to be a straw-man argument. Solutions to classification problems frequently involve the assessment of probabilities, rather than simple "terrorist" / "non-terrorist" outputs. The output of data mining in this case should not be used as a replacement of the judicial branch, but as a guide: Estimated probabilities can be used to prioritize, rather than condemn, individuals under scrutiny.

Tuesday, February 06, 2007

Quote of the day

"[Statistics] means never having to say you're sure."

I first heard this from John Elder, and it is documented here where John presented an summary of the Symposium on the Interface conference for SIGKDD Explorations (Jun06), though I think he gave the talk initially at KDD-98.

John doesn't name who said it though, and have never heard him name the person. Maybe so many have said it, that it is just one of those anonymous quotes that is ubiquitous, but in a quick search, the only place I found it was as a reference--in the title of a talk at a Fisheries conference:

Hayes, D. B., and J.R. Bence. Managing under uncertainty, or, statistics is never having to say you’re sure. Michigan Chapter of the American Fisheries Society, East Lansing, MI. 1996.

Monday, February 05, 2007

Poll: Model Selection

In the spirit of the latest posts on model selection, here is a poll to get feedback on that question. I understand that few practitioners always use the exact same metric to select models. This poll is only asking which one is used most often when you need a single number to select models (and input variables don't matter as much).



Create polls and vote for free. dPolls.com

Thursday, February 01, 2007

When some models are signficantly better than others

I'm not a statistician, nor have I played one on TV. That’s not to say I’m not a big fan of statistics. In the age-old debate between data mining and statistics, there is much to say on both sides of the aisle. While much of this kind of debate I find unnecessary, and conflicts have arisen as much over terminology rather than the actual concepts, there are some areas where I have found a sharp divide.

One of these areas is the idea of significance. Most statisticians who excel in their craft that I have spoken with are well-versed in discussions of p-values, t-values, and confidence intervals. Most data miners, on the other had, have probably never heard of these, or even if they have, never use them. Aside from the good reasons to use or not use these kind of metrics, I think it typifies an interesting phenomenon in the data mining world, which is the lack of measures of significance. I want to consider that issue in the context of model selection: how does one assess whether or not two models are different enough so that there are compelling reasons to select one over the other?

One example of this is what one sees when using a tool like Affinium Model (Unica Corporation)—a tool I like to use very much. If you are building a binary classification model, it will build for you, automatically, dozens, hundreds, potentially even thousands of models of all sorts (regression, neural networks, C&RT trees, CHAID trees, Naïve Bayes). After the models have been built, you get a list of the best models, sorted by whatever metric you have decided (typically area under the lift curve or response rate at a specified file depth). All of this is great. The table below shows a sample result:



Model.........Rank..Total Lift....Algorithm

NeuralNet1131...1....79.23%....Backpropagation Neural Network
NeuralNet1097...2....79.20%....Backpropagation.Neural.Network
NeuralNet1136...3....79.18%....Backpropagation.Neural.Network
NeuralNet1117...4....79.10%....Backpropagation.Neural.Network
NeuralNet1103...5....79.09%....Backpropagation.Neural.Network
Logit774........6....78.91%....Logistic.Regression
Bayes236........7....78.50%....Naive Bayes
LinReg461.......8....78.48%....Linear.Regression
CART39..........9....75.75%....CART
CHAID5.........10....75.27%....CHAID

Yes, the Neural Network model (NeuralNet1131) has won the competition and has the best total lift. But the question is this: is it significantly better than the other models? (Yes, linear regression was one of the options for a binary classification model—and this is a good thing, but a topic for another day). How much improvement is significant? There is no significance test applied here to tell us this. (to be continued…)