Thursday, May 27, 2010

PAKDD-10 Data Mining Competition Winner: Ensembles Again!

The PAKDD-10 Data Mining Competition results are in, and ensembles occupied the top 4 positions, and I think the top 5. The winner used Stochastic Gradient Boosting and Random Forests in Statistica, second place a combination of logistic regression and Stochastic Gradient Boosting (and Salford Systems CART for some feature extraction). Interestingly to me, the 5th place finisher used WEKA, an open source software tool.

The problem was credit risk with biased data for building the models, a good way to do the competition because this is the problem we usually face anyway: data was collected based on historic interactions with the company, biased by the approaches the company has used in the past rather than having a pure random sample to build models. Model performance was judged based on  Area under the Curve (AUC), with the KS distance as the tie breaker (it's not everyday I hear folks pull out the KS distance!).

One submission in particular commented on the difference between how algorithms build models and the metric used to evaluate them. CART uses the Gini Index, Logistic regression the log-odds, Neural Networks minimize mean squared error (usually), none of which directly maximize AUC. But this topic is worthy of another post.

1 comment:

Media Optimization said...

Indeed, the content of this post is very interesting and it has an informative

that is great. I really like your post because I really enjoyed reading your blog.

And I have been talking about this subject a lot lately with my friend so

hopefully this will get him to see my point of view with Data Integration