Applied Data Science and Machine Learning: 11/01/2009

Monday, November 23, 2009

Stratified Sampling vs. Posterior Probability Thresholds

One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.

The setup goes something like this: you have 1% responders, you build models, and the model "says" every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer "0". The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.

I can't claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don't have the "model calls everything a 0" problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I've been doing it ever since (with logistic regression or neural networks in particular).

What was she saying? First, it needs to be stated that no algorithm produces "decisions". Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities--maybe I'll blog on this more another time). The decisions that one sees ("all records are called 0s") are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don't think I've ever found a data mining software package that doesn't threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.

So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don't worry about thresholding anyway.

Which brings me to the conversation with Tim Manns. I'm glad he tried it out himself, though I don't think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim's trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is "exposed". In Clementine, if you keep it as a "flag" variable, you would threshold the propensity value ($NRP-target).

So, read Tim's post (and his other posts!). This trick can be used with nearly any tool--I've done it with Matlab and Tibco Spotfire Miner, among others).

Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).

Thursday, November 12, 2009

San Diego Forum on Analytics -- review

I just got back from the 1/2 day Forum on Analytics in San Diego, and included a keynote by Wayne Peacock (now with Inevit, bur formerly VP of BI at Netflix), who spoke on how pervasive analytics was and is at Netflix, covering areas as diverse as finance, customer service, marketing, network optimization, operations, and product development. It was particularly interesting to me that as of 2006, their data warehouse was not in place, but instead the had a "data landfill" (term of the day for me!). The other quote from his talk that I found provocative was related to their web site, "If the web site doesn't go down once a year, we aren't pushing hard enough." However, this is changing somewhat because of their online content delivery, where websites going down have a much bigger downside!

The rest of the morning contained 3 panel discussions, which was interesting in of itself to see what topics were considered most important: Mining Biodata, Web 3.0, and Job Opportunities in Analytics.

During the Biodata panel, Nancy Miller Latimer of Accelrys, Inc. mentioned in passing a software tool that ehy have developed to do essential visual programming of biodata; it looks like the typical Clementine/Enterprise Miner/Tibco Spotfire Miner/Polyanalyst (and in so many other tools, including Statistica and Weka) interface for doing data prep, but their tool is specific for biodata, including loading technical papers, chemical structure data, etc. I've been fascinated for years by the relatively parallel paths taken by the bioinformatics/cheminformatics world and the data mining world: very similar ideas, but very different toolsets because of the very different characteristics of the data. Much was said about the future of sequencing of the human genome: 2 humans in 2007, 6+ in 2008, perhaps 150 in 2009 and growing exponentially (faster than Moore's law). There was talk of the $1000 human sequence soon.

The Web 3.0 panel included 2 folks from Intuit touting a facebook campaign done to grow use of Turbotax virally. Interesting stuff, but I'm still dubious of the effect of social networking on all but the under 30 crowd. I think I'll finally begin to tweet, but only out of curiosity, not because I expect anything of business value from it. Is it inevitable that Facebook, Twitter, and Youtube will become mainstream ways to develop business? For me? I don't see how for me yet.

Lastly, on the analytics jobs in San Diego...there are over 100 analytics companies in San Diego (most of them undoubtedly small or micro, like me), and there was an evangelistic cry for San Diego to become an analytics cluster in the U.S. I think this is actually possible, and has been the case to some degree for some time now. I had forgotten about the Keylime (a San Diego web company) being purchased by Yahoo, and Websidestory being purchased by Omniture. Of course Fair Isaacs and HNC were discussed as well. Time will tell, and right now, things are tough all around, though Kanani Masterson of TriStaff Group said there were currently 225 analytics / web analytics job openings, so things aren't completely dead.

All in all, it was a lot to pack into a morning.

Applied Data Science and
Machine Learning

Monday, November 23, 2009

Stratified Sampling vs. Posterior Probability Thresholds

Thursday, November 12, 2009

San Diego Forum on Analytics -- review

Applied Predictive Analytics

Contributors

Our Web Sites

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and Machine Learning

Monday, November 23, 2009

Stratified Sampling vs. Posterior Probability Thresholds

Thursday, November 12, 2009

San Diego Forum on Analytics -- review

Applied Predictive Analytics

Contributors

Our Web Sites

Subscribe To This Blog

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and
Machine Learning