Saturday, September 07, 2013

On Data Mining Contests

Data mining contests have grown in popularity over the years, from the annual competitions at technical conferences to the continuous stream of events at sites like Kaggle. This has yielded several benefits, allowing many experts to work on difficult problems, giving novices a chance to work on real data and showcasing successful solutions. These competitions have even garnered the attention of the mainstream press. While believing that the spread of these technical contests has been largely positive, this author feels that it's worth noting the limitations of these contests.

Despite using real data, the problems, as formulated, are somewhat artificial. Questions of sampling and initial variable selection have already been decided, as have the evaluation function and the model's part in the ultimate solution. To some extent, these are necessary constraints, but they are constraints nonetheless. In real world data mining, all of these questions are the responsibility of the data miner and his or her clients, and they are not trivial considerations. In most larger organizations, the data is large enough that there is always "one more" table in the database which could be tapped for candidate predictors. Likewise, how the model might best be positioned as part of the total solution is not always obvious, especially in more complex problems. A minority of contests permit the use of outside data, but even this is somewhat unrealistic since real organizations have budgets for the purchase of outside data, such as demographic data to be appended to a customer population. I've yet to learn of anyone paying for outside variables to append to competition data, though.

Another issue is the large number of competitors which these contests attract. Though it is good to have many analysts take a crack at a problem, one must wonder about the statistical significance of having hundreds of statisticians test God-only-knows how many hypotheses against the same data. Further, the number of competitors and the similarity of top contestants' performance figures make selection of a single "winner" a dubious proposition.

Finally, it has become rather common for winners of the contests to construct solutions of vast proportions- typically ensembles of gigantic number of base models. While such models may be feasible to deploy in some circumstances, they far too computationally demanding to execute on many real databases quickly enough to be practical.

Some of these criticisms are probably unavoidable, especially the ones regarding the pre-selected, pre-digested contest data. Still, it'd be interesting to see future data mining competitions address at least some of these issues. For one thing, it might be interesting to see solution sizes (lines of SQL or C++ or something similar) limited to something which ordinary IT departments would be capable of executing during a typical overnight run. Averaging across an increased number of tasks might begin to improve the significance of differences among contestants' performances.