Applied Data Science and Machine Learning: 12/01/2006

Wednesday, December 27, 2006

Two Book Recommendations

In my data mining courses, there are two books I always recommend to course attendees who are new to data mining. The first is Data Preparation for Data Mining by Dorian Pyle. I like this book because data preparation is usually the most time-consuming step in the data mining process, and there is only one book I know of that is written entirely for the purpose of data preparation (the second hit in the amazon list I linked is a data prep for SAS book, but that one is SAS-specific).

The second book I recommend is for the analyst who is not a statistician is Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank. They do a great job of describing algorithms and techniques in data mining in an intuitive way; there are few equations and derivations to cloud the issues for non-mathematicians. The biggest critique I have is that there is no description of neural networks, one of the key algorithms in data mining software packages. But that doesn't dampen my enthusiasm for the book. (If you would like a good, free description of neurla networks, go to the SAS Neural Network FAQ.)

Data mining and Terrorism

Just came across an article by Jim Harper entitled Data Mining Can't Improve Our Security that argues persuasively against the use of data mining in such matters. He writes

Data mining for terrorism prediction has two fundamental flaws:

— First, terrorist acts and their precursors are too rare in our society for there to be patterns to find. There simply is no nugget of information to mine.

— Second, the lack of suitable patterns means that any algorithm used to turn up supposedly suspicious behavior or suspicious people will yield so many false positives as to make it useless. A list of potential terror suspects generated from pattern analysis would not be sufficiently targeted to justify investigating people on the list.

and concludes

Unfortunately, there is no magic bullet that solves the security conundrums created by terrorism. Data mining is a useful technique in many areas, but not this one.

I must say that in general I agree, and have argued the same privately to course attendees who were interested in this type of analysis. [A disclamor: I have never worked with intellgences data, so these opinions are not based on direct experience with terrorist-related data.] However, I also believe that while the data is particularly difficult, it is not useless, and therefore disagree with the final conclusion. The question I think should be asked is this: can data mining improve the ability of law enforcement to identify suspected terrorists. Now it may be that any improvement in information provided by data mining may not worth the effort (as he argues)--this I don't know. But if it can improve the odds of finding a dangerous individual 10-fold or 100-fold over what is currently done, then is it not helpful? (I won't comment on the privacy issues here--that is another important issue, but unrelated to his premise that data mining is not useful here).

For those who have worked on difficult problems, the exploratory data analysis that takes place during the data mining project nearly always yields useful information even if the no final models are produced. Therefore, the conclusion to not do any data mining at all seems to me to be an overreaction.

Tuesday, December 19, 2006

The Best Data Mining Book of 2005

A bit late, but better late than never! Actually, I just heard Stephen Levitt speak at SPSS Directions in November and was reminded, of course, of his book Freakonomics. In 2005, I recommended the book to my data mining course attendees as my favorite data mining book of the year, despite the term "data mining" never appearing (to the best of my knowlege) in the book at all. I think a quote in the preface summarizes why I liked it:

What interested Levitt were the stuff and riddles of everyday life... 'He (Levitt) is an intuitionist. He sifts through a pile of data to find a story no one else has found. He figures a way to measure an effect that veteran economists had declared unmeasurable.

It was the idea of "sifting", a prominent term in the Gartner Group definition (and one that I like in particular) that struck me. And all the examples Levitt gives in his book are examples of uncovering patterns in data that are not the most obvious answers, but rather are ones that fit the data better (in his opinion). I like the book because he approaches data with a forensic mindset.

Tuesday, December 05, 2006

Distance metrics

Will Dwinnell (fellow poster here, but who also has a blog specific to Matlab at http://matlabdatamining.blogspot.com) recently posted on Mahalanobis distance as an alternative to Euclidean distance. We are kindred spirits on this one as I have long advocated the Mahalanobis distance, particularly for data that is close to being normally distributed (there are fixes to make numeric data more normally distributed, of course, but that's for another post, perhaps).

The reasons he gives are right on point, but I'd like to expand the application side. I was first introduced to Mahalanobis distance in the context of Nearest Mean classifiers. In case anyone is not familiar with the M. distance, it weights the Euclidean distance by the covariance matrix (think of Euclidean distance as weighing the distance by the Identity matrix). But it is very useful in many other contexts in addition to these or Radial Basis Function networks, and in fact, any time you compute a distance in an algorithm the M. distance is my preferred distance metric, including k-nearest neighbor, and clustering (like Isodata or K-Means).

The problem is that very few data mining software packages have it. I was introduced to it in an obsolete tool called OLPARS (I also wrote code for it, including Perceptrons, RBFs, and some neural networks). But Matlab does it quite easily.