One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.
The setup goes something like this: you have 1% responders, you build models, and the model "says" every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer "0". The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.
I can't claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don't have the "model calls everything a 0" problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I've been doing it ever since (with logistic regression or neural networks in particular).
What was she saying? First, it needs to be stated that no algorithm produces "decisions". Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities--maybe I'll blog on this more another time). The decisions that one sees ("all records are called 0s") are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don't think I've ever found a data mining software package that doesn't threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.
So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don't worry about thresholding anyway.
Which brings me to the conversation with Tim Manns. I'm glad he tried it out himself, though I don't think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim's trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is "exposed". In Clementine, if you keep it as a "flag" variable, you would threshold the propensity value ($NRP-target).
So, read Tim's post (and his other posts!). This trick can be used with nearly any tool--I've done it with Matlab and Tibco Spotfire Miner, among others).
Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).
Interesting post. I'm only starting to appreciate the concept of a balanced sample!
ReplyDeleteHey Dean,
ReplyDeleteGood topic, good post. This is one of my favorites as well. I wrote a short article with a similar slant a few months ago. (See http://www.discoverycorpsinc.com/data-mining-visualization/data_mining_misconceptions_1.html)
Hope you and others find it useful.
-Tim
Of course, there are often some good other reasons why you might want to oversample, especially if your chosen tool is taking too long to process the full file due to multiple rows or columns. Often there is a lot of junk information in the zeroes, so nothing is really being lost.
ReplyDeleteThe corrollary of Dean's post is that choosing a sensible decision boundary for a classification problem with multiple classes is much harder. Setting the thresholds to the priors is great if you can, and you might be able to get a similar output from setting up multiple {0,1} outputs in a neural network corresponding to the classes. However, I usually end up having to look at multiple ROC curves and coming up with something sensible to show my client.
--James
James, you are absolutely correct. I don't mean to imply at all that the posterior thresholding does anything particularly interesting. I usually bring this point up mostly to argue that stratification of the target variable is not necessary, though still may be useful, especially when there is a huge volume of 0s in the data (this is worthy of another post by itself).
ReplyDeletePersonally, I almost always use something like ROC curves, gains charts, profit charts, etc. to assess models. I love the Provost / Fawcett paper maybe 10 years ago on examining the convex hull of ROC curves when one is unclear about the precise tradeoffs between sensitivity and false alarms.
It is also possible to use one-class classification or novelty detection for imbalanced data.
ReplyDeleteHi Dean,
ReplyDeleteMy problem is now that I have to think of another excuse not to use a confusion matrix :)
Thanks for pointing out the Provost/Fawcett paper! I think I read it once (it looks really familar). I downloaded it and other related papers from that site. They will be great resources to refer to in future.
I got a similar issue in my master degree dissertation using Neural Network on stock market!
ReplyDeleteI felt the same, I replace the categorical values (-1,0 and 1) by a continuos probability and after that applied a trueshold to set it into 3 classes.
very good blog!
Pedro
http://www.pedrocgd.blogspot.com
With my web data, I simply use stratified sampling since I have much more clickers than non-clickers. Thanks for this very interesting post Dean!
ReplyDeleteGreat post Dean, thanks! On your point about algorithms not producing decisions/solutions - do you mean that there is always a person who has to interpret the information the algorithm produces? Is it feasible in the near future for a complex algorithm to actually produce a "decision" or action? I think there are companies trying to produce predictive analytics solutions that will reduce the number of points in the process that humans need to make a judgement call. Thoughts?
ReplyDeleteI mean the algorithms themselves, not the people. An algorithm like a decision tree doesn't generate a 1/0 answer for binary classification (nor do any algorithm). They generate probabilities (or some other number between 0 and 1). So I'm not addressing people in the loop at all here, which is a good question to ask. I'm only addressing how to utilize what the algorithms produce.
ReplyDelete