Thursday, May 27, 2010

PAKDD-10 Data Mining Competition Winner: Ensembles Again!

The PAKDD-10 Data Mining Competition results are in, and ensembles occupied the top 4 positions, and I think the top 5. The winner used Stochastic Gradient Boosting and Random Forests in Statistica, second place a combination of logistic regression and Stochastic Gradient Boosting (and Salford Systems CART for some feature extraction). Interestingly to me, the 5th place finisher used WEKA, an open source software tool.

The problem was credit risk with biased data for building the models, a good way to do the competition because this is the problem we usually face anyway: data was collected based on historic interactions with the company, biased by the approaches the company has used in the past rather than having a pure random sample to build models. Model performance was judged based on  Area under the Curve (AUC), with the KS distance as the tie breaker (it's not everyday I hear folks pull out the KS distance!).

One submission in particular commented on the difference between how algorithms build models and the metric used to evaluate them. CART uses the Gini Index, Logistic regression the log-odds, Neural Networks minimize mean squared error (usually), none of which directly maximize AUC. But this topic is worthy of another post.

Tuesday, May 25, 2010

The Trimmed Mean has Intuitive Appeal

I was listening to Colin Cowherd of ESPN radio this morning and he made a very interesting observation that we data miners know, or at least should know and make good use of. The context was evaluating teams and programs: are they dynasties or built off of one great player or coach. Lakers? dynasty. Celtics? dynasty. Bulls? without Jordan, they have been a mediocre franchise. The Lakers without Magic are still a dynasty. The Celtics without Bird are still a dynasty.

So his rule of thumb that he applied to college football programs was this: remove the best coach and the worst coach, and then assess the program. If they are still a great program, they are truly a dynasty.

This is the trimmed (truncated) mean idea that he was applying intuitively but is quite valuable in practice. When we assess customer lifetime value, if a small percentage of the customers generate 95% of the profits, examining those outliers or the long tail while valuable does not get at the general trend. When I was analyzing IRS corporate tax returns, the correlation between two line items (that I won't identify here!) was more than 90% over the 30K+ returns. But when we removed the largest 50 corporations, the correlation between these line items dropped to under 30%. Why? Because the tail drove the relationship; the overall trend didn't apply to the entire population. It is easy to be fooled by summary statistics for this reason: they assume characteristics about the data that may not be true.

This all gets back to nonlinearity in the data: if outliers behave differently than the general population, assess them based on the truncated populations. If outliers exist in your data, get the gist from the trimmed mean or median to reduce the bias from the outliers. We know this intuitively, but sometimes we forget to do it and make misleading inferences.

[UPDATE] I neglected to reference a former post that shows the problem of outliers in computing correlation coefficients: Beware of Outliers in Computing Correlations.

Sunday, May 23, 2010

Upcoming DMRadio Interview: Analytics and Business Rules

On June 3rd, a week from this Thursday, I'll be participating in my third DMRadio interview, this time on business rules (the first two were related to text mining, including this one last year). I always have found these interviews enjoyable to do. I'll probably be discussing an inductive rule discovery process I participated in with a Fortune 500 company (and described at last February's Predictive Analytics World Conference in San Francisco).

Even if you can't be there "live", you can download the interview later.

Thursday, May 20, 2010

Data Mining as a Top Career

More good news for data miners:

Data mining. The field involves extracting specific information or patterns from large databases. Career prospects are available in areas including advertising technology, scientific research and law enforcement.
I think they got it right: data mining (and it's siblings Predictive Analytics and Business Analytics) are growing in their appeal. But more importantly, I see organizations believing they can do it.

Of course time will tell. One sign will be how many more resumes (unsolicited) I get!

Tuesday, May 11, 2010

web analytics and predictive analytics: comments from emetrics

I just got back from the latest (and my first) eMetrics conference in San Jose, CA last week, and was very impressed by the practical nature of the conference. It was also a quite different experience for me to be in a setting where I knew very very few people there. I was there to co-present with Angel Morales "Behavioral Driven Marketing Attribution". Angel and I are co-founders of SmarterRemarketer, a new web analytics company, and this solution we described is just one nut we are trying to crack in the industry.

This post though is related to the overlap between web analytics and predictive analytics: very little right now. It really is a different world, and for many I spoke with, the mere mention of "predictive analytics" resulted in one of those unknowing looks back at me. In fairness, much that was spoken to me resulted in the same look!

One such topic was that of "use cases", a term used over and over in talks, but one that I don't encounter in the data mining world. We describe "case studies", but a "use case" is a smaller and more specific example of something interesting or unusual in how individuals or groups of individuals interact with web sites (I hope I got that right). The key though is that this is a thread of usage. In data mining, it is more typical that predictive models are built, and then to understand why the models are the way they are, one might trace through some of the more interesting branches of a tree or unusual variable combinations in something similar to this "use case" idea.

First, what to commend... The analyses I saw were quite good: customer segmentation, A/B testing, web page layout, some attribution, etc. There was a great keynote by Joe Megibow of Expedia describing how Expedia's entire web presence has changed in the past year. One of my favorite bloggers, Kevin Hillstrom of MineThatData fame gave a presentation praising the power of conditional probabilities (very nice!).  Lastly, there was one more keynote by someone I had never heard of (not to my credit), but is obviously a great communicator and is well-known in the web analytics world, Avinash Kaushik. One idea I liked very much from his keynote was the long tail: the tail of the distribution of keywords that navigates to his website contains many times more visits than his top 10. In the data mining world, of course, this would push us to characterize these sparsely populated items differently so they produce more influence in any predictive models. Lots to think about.

But I digress. The lack of data mining and predictive analytics at this conference begs (at least from me) the question: why not? They are swimming in data, have important business questions that need to be solved, and clearly not all of these are being solved well enough. That will be the subject of my next post.

Monday, May 10, 2010

Rexer Analytics Data Mining Survey

Calling all data miners! I encourage all to fill out the survey--it is the most complete survey of the data mining world that I am aware of. Use the link and code below, and stay tuned to see the results later in the year.

Survey Link:
Access Code:  RS2458

The full description sent by Karl Rexer is below:

Rexer Analytics, a data mining consulting firm, is conducting our fourth annual survey of the analytic behaviors, views and preferences of data mining professionals.  We would greatly appreciate it if you would:

1)       Participate in this survey, and
2)       Tell other data miners about the survey (forward this email to them).

Thank you.  Forwarding the survey to others is invaluable for our “snowball sample methodology”.  It helps the survey reach a wide and diverse group of data miners.   Thank you also to everyone who participated in previous Data Miner Surveys, and especially to the people who provided suggestions for new questions and other survey modifications.  This year’s survey incorporates many ideas from survey participants.

Your responses are completely confidential: no information you provide on the survey will be shared with anyone outside of Rexer Analytics.  All reporting of the survey findings will be done in the aggregate, and no findings will be written in such a way as to identify any of the participants.  This research is not being conducted for any third party, but is solely for the purpose of Rexer Analytics to disseminate the findings throughout the data mining community via publication, conference presentations, and personal contact. 

If you would like a summary of last year’s or this year’s findings emailed to you, there will be a place at the end of the survey to leave your email address.  You can also email us directly ( if you have any questions about this research or to request research summaries.

To participate, please click on the link below and enter the access code in the space provided.  The survey should take approximately 20 minutes to complete.  Anyone who has had this email forwarded to them should use the access code in the forwarded email.

Survey Link:
Access Code:  RS2458

Thank you for your time.  We hope the results from this survey provide useful information to the data mining community.