Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Wednesday, December 27, 2006
Two Book Recommendations
The second book I recommend is for the analyst who is not a statistician is Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank. They do a great job of describing algorithms and techniques in data mining in an intuitive way; there are few equations and derivations to cloud the issues for non-mathematicians. The biggest critique I have is that there is no description of neural networks, one of the key algorithms in data mining software packages. But that doesn't dampen my enthusiasm for the book. (If you would like a good, free description of neurla networks, go to the SAS Neural Network FAQ.)
Data mining and Terrorism
Data mining for terrorism prediction has two fundamental flaws:
— First, terrorist acts and their precursors are too rare in our society for there to be patterns to find. There simply is no nugget of information to mine.
— Second, the lack of suitable patterns means that any algorithm used to turn up supposedly suspicious behavior or suspicious people will yield so many false positives as to make it useless. A list of potential terror suspects generated from pattern analysis would not be sufficiently targeted to justify investigating people on the list.
and concludes
Unfortunately, there is no magic bullet that solves the security conundrums created by terrorism. Data mining is a useful technique in many areas, but not this one.
I must say that in general I agree, and have argued the same privately to course attendees who were interested in this type of analysis. [A disclamor: I have never worked with intellgences data, so these opinions are not based on direct experience with terrorist-related data.] However, I also believe that while the data is particularly difficult, it is not useless, and therefore disagree with the final conclusion. The question I think should be asked is this: can data mining improve the ability of law enforcement to identify suspected terrorists. Now it may be that any improvement in information provided by data mining may not worth the effort (as he argues)--this I don't know. But if it can improve the odds of finding a dangerous individual 10-fold or 100-fold over what is currently done, then is it not helpful? (I won't comment on the privacy issues here--that is another important issue, but unrelated to his premise that data mining is not useful here).
For those who have worked on difficult problems, the exploratory data analysis that takes place during the data mining project nearly always yields useful information even if the no final models are produced. Therefore, the conclusion to not do any data mining at all seems to me to be an overreaction.
Tuesday, December 19, 2006
The Best Data Mining Book of 2005
What interested Levitt were the stuff and riddles of everyday life... 'He (Levitt) is an intuitionist. He sifts through a pile of data to find a story no one else has found. He figures a way to measure an effect that veteran economists had declared unmeasurable.It was the idea of "sifting", a prominent term in the Gartner Group definition (and one that I like in particular) that struck me. And all the examples Levitt gives in his book are examples of uncovering patterns in data that are not the most obvious answers, but rather are ones that fit the data better (in his opinion). I like the book because he approaches data with a forensic mindset.
Tuesday, December 05, 2006
Distance metrics
The reasons he gives are right on point, but I'd like to expand the application side. I was first introduced to Mahalanobis distance in the context of Nearest Mean classifiers. In case anyone is not familiar with the M. distance, it weights the Euclidean distance by the covariance matrix (think of Euclidean distance as weighing the distance by the Identity matrix). But it is very useful in many other contexts in addition to these or Radial Basis Function networks, and in fact, any time you compute a distance in an algorithm the M. distance is my preferred distance metric, including k-nearest neighbor, and clustering (like Isodata or K-Means).
The problem is that very few data mining software packages have it. I was introduced to it in an obsolete tool called OLPARS (I also wrote code for it, including Perceptrons, RBFs, and some neural networks). But Matlab does it quite easily.