Thursday, April 26, 2012

Another Wisdom of Crowds Prediction Win at eMetrics / Predictive Analytics World

This past week at Predictive Analytics World / Toronto (PAW) has been a great time for connecting with thought leaders and practitioners in the field. Sometimes there are unexpected pleasures as well, which is certainly the case this time. One of the exhibitors for the eMetrics conference, co-locating with PAW at the venue, was Unilytics, a web analytics company. At their booth there was a cylindrical container filled with crumpled dollar bills with a sign soliciting predictions of how many dollar bills were in the container (the winner getting all the dollars). After watching the announcement of the winner, who guessed $352, only $10 off from the actual $362, I thought this would be the perfect opportunity for another Wisdom of Crowds test,just like the one conducted 9 months ago and blogged on here.
Two Unilytics employees at the booth, Gary Panchoo and Keith MacDonald, were kind enough to indulge me and my request to compute the average of all the guesses. John Elder was also there, licking his wounds from finished a close second as his guess, $374 was off by $12, a mere $2 away from the winning entry! The results of the analysis are here (summary statistics created by JMP Pro 10 for the mac). In summary, the results are as follows:

Dollar Bill Guess Scores

MethodGuess ValueError
Ensemble/Average (N=61)3653
Winning Guess (person)35210
John Elder37412
Guess without outlier (2000), 3rd place33824
Median, 19th place27587

So once again, the average of the entries (the "Crowds" answer) beat the single best entry. What is fascinating to me about this is not that the average won (though this in of itself isn't terribly surprising), but rather how it won. Summary statistics are below. Note that the Median is 275, far below the mean. Not too how skewed the distribution of guesses are (skew = 3.35). The fact that the guesses are skewed positively for a relatively small answer (362) isn't a surprise, but the amount of skew is a bit surprising to me. What these statistics tell us is that while the mean value of the guesses would have been the winner, a more robust statistic would not, meaning that the skew was critical in obtaining a good guess. Or put another way, people more often than not under-guessed by quite a bit (the median is off by 87). Or put a third way, the outlier (2000) which one might naturally want to discount because it was a crazy guess was instrumental to the average being correct. In the prior post on this from July 2011, I trimmed the guesses, removing the "crazy" ones. So when should we remove the wild guesses and when shouldn't we? (If I had removed the 2000, the "average" still would have finished 3rd). I have no answer to when the guesses are not reasonable, but wasn't inclined to remove the 2000 initially here. Full stats from JMP are below, with the histogram showing the amount of skew that exists in this data.

Distribution of Dollar Bill Guesses - Built with JMP

Summary Statistics

StatisticGuess Value
Std Dev299.80071
Std Err Mean38.385548
Upper 95% Mean441.78253
Lower 95% Mean288.21747
2% Trimmed Mean331.45614
Interquartile Range185.5

Note: The mode shown is the smallest of 2 modes with a count of 3.


QuantileGuess Value

Monday, April 09, 2012

Dilbert, Database marketing and spam

Ruben's comment that referred to spam reminded me of an old Dilbert comic which conveys the misconception about database marketing (e-marketing) and spam.

I know Ruben well and know he was poking fun, though I still have to correct folks who after finding out I do "data mining" actually comment that I'm responsible for spam. Answer: "No, I'm the reason you don't get as much spam!"

Friday, April 06, 2012

What I'm Working On

Sometimes folks ask me what I'm doing, so I thought I'd share a few things on my plate right now:

Courses and Conferences
1. Reading several papers for the KDD 2012 Conference Industrial / Government Track
2. Preparing for the Predictive Analytics World / Toronto "Advanced Methods Hands-on:
Predictive Modeling Techniques
" workshop on April 27. I'm using the Statsoft Statistica package.
3. Starting preparation for a talk at the Salford Analytics and Data Mining Conference 2012, "A More Transparent Interpretation of Health Club Surveys" on May 24. It will highlight use of the CART software package in the analysis. This was work that motivated interviews with New York Times reporter Charles Duhigg, and ended with a mention (albeit *very* briefly) in the fascinating new book by Duhigg, "The Power of Habit: Why We Do What We Do in Life and Business".
4. Working through data exercises for the next UCSD-Extension Text Mining Course on May 11, 18, and 25th. I'm using KNIME for this course.

Approximately 80% of my time is spent on active consulting. While I can't describe most of the work I'm doing, my current clients are in the following domains:
1. Web Analytics and email remarketing for retail via a great startup company, Smarter Remarketer headed by Angel Morales (Founder and Chief Innovation Officer), Howard Bates (CEO), and me (Founder and Chief Scientist).
2. Customer Acquisition, web/online and offline (2 clients)
3. Tax Modeling (2 clients)
4. Data mining software tool selection for a large health care provider.

Here's to productive application of predictive analytics!

Thursday, April 05, 2012

Why Defining the Target Variable in Predictive Analytics is Critical

Every data mining project begins with defining what problem will be solved. I won't describe the CRISP-DM process here, but I use that general framework often when working with customers so they have an idea of the process.

Part of the problem definition is defining the target variable. I argue that this is the most critical step in the process that relates to the data, and more important than data preparation, missing value imputation, and the algorithm that is used to build models, as important as they all are.

The target variable carries with it allthe information that summarizes the outcome we would like to predict from the perspective of the algorithms we use to build the predictive models. Yet this can be misleading is many ways. I'm addressing one way we can be fooled by the target variable here, and please indulge me to lead you down the path.

Let's say we are building fraud models in our organization. Let's assume that in our organization, the process for determining fraud is first to identify possible fraud cases (by tips or predictive models), then assign the case to a manager who determines which investigator will get the case (assuming the manager believes there is value in investigating the case), then assign the case to an investigator, and if fraud is found, the case is tried in court, and ultimately a conviction is made or the party is found not guilty.

Our organization would like to prioritize which cases should be sent to investigators using predictive modeling. It is decided that we will use as a target variable all cases that were found to be fraudulent, that is, all cases that had been tried and a conviction achieved. Let's assume here that all individuals involved are good at their jobs and do not make arbitrary or poor decisions (which of course is also a problem!)

Let's also put aside for a moment the time lag involved here (a problem itself) and just consider the conviction as a target variable. What does the target variable actually convey to us? Of course our desire is that this target variable conveys fraud risk. Certainly when the conviction has occurred, we have high confidence that the case was indeed fraudulent, so the "1"s are strong and clear labels for fraud.

But, what about the "0"s? Which cases do they include?
--cases never investigated (i.e., we suspect they are not fraud, but don't know)
--cases assigned to a manager who never assigned the case (he/she didn't think they were worth investigating).
--cases assigned to an investigator but the investigation has not yet been completed, or was never completed, or was determined not contain fraud
--cases that went to court but was found "not guilty"

Remember, all of these are given the identical label: "0"

That means that any cases that look on the surface to be fraudulent, but there were insufficient resources to investigate them, are called "not fraudulent. That means cases that were investigated but the investigator was taken off the case to investigate other cases are called "not fraudulent". It means too that court cases that were thrown out of court due to a technicality unrelated to the fraud itself are called "not fraud".

In other words, the target variable defined as only the "final conviction" represents not only the risk of fraud for a case, but also the investigation and legal system. Perhaps complex cases that are high risk are thrown out because they aren't (at this particular time, with these particular investigators) worth the time. Is this what we want to predict? I would argue "no". We want our target variable to represent the risk, not the system.

This is why when I work on fraud detection problems, the definition of the target variable takes time: we have to find measures that represent risk and are informative and consistent, but don't measure the system itself. For different customers this means different trade-offs, but usually it means using a measure from earlier in the process.

So in summary, think carefully about the target variable you are defining, and don't be surprised when your predictive models predict exactly what you told them to!