Tuesday, May 01, 2007

Comparison of Algorithms at PAKDD2007


At the link in the title are the results from the 11th Pacific-Asia Knowledge Discovery and Data Mining conference (PAKDD 2007). The dataset for the competition was a cross-seller dataset, but that is not of interest to me here. The interesting this is these: which algorithms did the best, and were they significantly different in their performance?

A note about the image in the post: I took the results, sorted by area under the ROC curve (AUC). The results already had color coded the groups of results (into winners, top 10 and top 20)--I changed the colors to make them more legible. I also added red-bold text to the algorithm implementations that included an ensemble (note that after I did this, I discovered that the winning Probit model was also an ensemble).

And, for those who don't want to look at the image, the top four models were as follows:

AUC.......Rank...Modeling Technique
70.01%.....1.....TreeNet + Logistic Regression
69.99%.....2.....Probit Regression
69.62%.....3.....MLP + n-Tuple Classifier
69.61%.....4.....TreeNet


First, note that all four winners used ensembles, but ensembles of 3 different algorithms: Trees (Treenet), neural networks, and probits. The differences between these results are quite small (arguably not significant, but more testing would have to take place to show this). The conclusion I draw from this then is that the ensemble is more important than the algorithm; so long as there are good predictors, variation in data used to build the models, and sufficient diversity in predictions issued by the individual models.

I have not yet looked at the individual results to see how much preprocessing was necessary for each of the techniques, however I suspect that less was needed for the TreeNet models just because of the inherent characteristics of CART-styled trees in handling missing data, outliers, and categorical/numeric data.

Second, and related to the first, is this: while I still argue that generally speaking, trees are less accurate than neural networks or SVMs, ensembles level the playing field. What surprised me the most was that logistic regression or Probit ensembles performed as well as they did. This wasn't because of the algorithm, but rather because I haven't yet been convinced that Probits or Logits consistently work well in ensembles. This is more evidence that they do (though I need to read further how they were constructed to be able to comment on why they did as well as they did).

11 comments:

Sandro Saitta said...

Results are interesting, although it is difficult to know if good results of algorithms depend on their efficiency or popularity. I'm also surprised not to see SVM in the top 20.

Dean Abbott said...

You're right that the size of the sample here is too small to infer too much about particular algorithms (like SVMs).

What I think is most interesting is regardless of the algorithm used, it was the fact that an ensemble combination of the models using that algorithm(s) was used, whether the base algorithm was a tree, probit, Neural net, ...

Therefore, my conclusion is that the base algorithm is not nearly as important (purely for performance) as combining predictions. What do you think?

Anonymous said...

I've also been doing some analysis of the results and have come to a similar conclusion that how ensembling is done is more important than the algorithm.

Findings to date..

1. An average ranking of the top 27 submissions would have won the competition.

2. An average ranking of the top 36 submissions would have come runner up.

3. Our summission would have jumped 5 places if we had taken an average rank of our 2 models (nn + GAM) rather than a logistic based weighting. Hence no change in algorithms, just how they are combined.

4. Taking an average rank seems to give better results than an average score.

5. The Treenet + Logistic winner appears to be very close to a linear combination of ranks of two other individual entries, hence the synergy of the committe can be quantified.

6. We built 10 neural network models each on a random 90% of the data and then averaged the rank order. The committe model was significantly better than any individual model and gave the same score as the winner. Adding more than 10 models to the committe did not improve things - the combination seemed to plateau.

I'll post a more comprehensive report at www.tiberius.biz next week.

Phil

Will Dwinnell said...

It's always interesting to see the results of this kind of "shoot-out". I find Sandro's comment astute, and I suppose there are a variety of biases in this sort of contests, such as who uses which tools versus who has time or incentive to compete.

A few years ago, Tjen-Sien Lim and colleagues undertook some fairly broad empirical analyses of machine learning algorithms:

http://citeseer.ist.psu.edu/171304.html

http://citeseer.ist.psu.edu/lim00comparison.html

Invariably, though, anyone with any experience can think of a modeling algorithm or commetcial offering which was not included, or criticize the choice of data sets used in the evaluation.

At least one article was published in NIPS that compared several diverse learning systems and argued that while model performance varied somewhat, modeling algorithms differed much more in terms of things other than model accuracy: training time, recall time, memory requirements, etc.

Dean Abbott said...

I think that one of the most interesting treatments of the "algorithm shootout" was done in the 90s as part of the statlog project and captured in a book entitled "Machine Learning, Neural and Statistical Classification" ed. by Michie, Spiegelhalter and Taylor. It's available online in PDF format here. Not only did they compare the algorithms, but they clustered the results to see which algorithms behaved similarly.

As Sandro and Will point out, there is, of course, much more going on that just the performance. In the Statlog project, they only used implementations of algorithms that were freely available, and from what I remember the algorithms were used with default settings. The great thing about the PAKDD results is that they are all optimized results, removing the bias that we all have, or most of us have, with one algorithm over another :) I always prefer these kinds of comparisons over an individual's comparisons.

To cover Will's point (and the related point I made in a prior comment), I would want another measure as to how much data prep was done to achieve the solution--that adds to the complexity of the solution and the time it takes to achieve the solution. This would include transformations of fields, missing value treatment, etc.

Phil's comments have prompted me to think about doing a meta analysis of the results myself. If I go down that road, I'll post again, but until then I'll look for Phil's findings and crosslink them here.

Will Dwinnell said...

Phil's comments have prompted me to think about doing a meta analysis of the results myself.

Yes, please do! I, for one, would be very interested.

Anonymous said...

For this data set I don't think the algorithm is that important as there is no real complicated equation involved to get a good solution. In our paper 083 we give some very simple rules of who to target and who not to target. I think the experience of the model builder in knowing some tricks of the trade is just as important.

What we did see was that the final results of most submissions were way below what they were expecting - most who published training results quoted 0.72. This is what promted me to investigate further as I wanted to know where I had gone wrong. What I have subsequently discovered and learned is the best thing that came out of this - our future models will be improved from the process. (They were not that bad in the first place, we came 2nd in accuracy in last years competiton).

Some other findings of note include...

In the training set the split responders/non-responders was 40000/700 where in the holdout set it was 8000/350. The division of the non-responders was done randonly but it was not stratified, meaning the non-responders in the 2 sets did not look exactly the same wrt the distribution of the input variables. This is always going to be an issue where there are not many targets in the first place.

This is why ensembles and bagging worked the best on this data. In a way an ensemble of different algorithms is doing a similar thing to bagging of the same algorithm, if the different algorithms are honing in on different features of the data.

The best solution I found was a linear addition of the rank orders of submission that came 2,3,4,7,22. If you look you will see there are lots of different algorithms in the mix here and it was significantly better than the winner.

Anyway more at http://www.tiberius.biz later. Will post here when I get it written up.

Phil

Anonymous said...

I think, that the secret of competitive data is simple.
In these data there are no strong patterns, characteristic for different classes.
Therefore competition win rather "rough" methods.
Those participants who understood, that it is necessary to do accent not on accuracy of the decision of a competitive problem, but on stability of this decision, have won.

Will Dwinnell said...

The last commenter makes some interesting points. In reflecting on them, I have two further thoughts:

1. It is absolutely amazing how many competitors continue to focus on apparent ("in-sample") model performance, only to see their scores crash when assessed on test data. One would have thought we'd be past this by now, but I suppose there is still a large fraction of analysts who "torture the data until it talks" by utilizing inadequate validation procedures. I know from direct observation that this happens in industry with distressing frequency.

2. These contests are always more interesting (and more meaningful, in my opinion), when multiple problems are posed.

Anonymous said...

Respected Will,
Thanks for your comment to my formulations.
I agree with your thoughts.
But for me it is interesting to analyze the data which contain complex patterns.
For example, I meet often such situations at processing of medical data.
On similar data are very brightly highlighted strong and weaknesses of various algorithms as here we deal with full scale combinatory problem.
V.D.

Anonymous said...

I've eventually gotten around to documenting some further findings on this data

http://www.tiberius.biz/pakdd07.html