Saturday, September 07, 2013

On Data Mining Contests

Data mining contests have grown in popularity over the years, from the annual competitions at technical conferences to the continuous stream of events at sites like Kaggle. This has yielded several benefits, allowing many experts to work on difficult problems, giving novices a chance to work on real data and showcasing successful solutions. These competitions have even garnered the attention of the mainstream press. While believing that the spread of these technical contests has been largely positive, this author feels that it's worth noting the limitations of these contests.

Despite using real data, the problems, as formulated, are somewhat artificial. Questions of sampling and initial variable selection have already been decided, as have the evaluation function and the model's part in the ultimate solution. To some extent, these are necessary constraints, but they are constraints nonetheless. In real world data mining, all of these questions are the responsibility of the data miner and his or her clients, and they are not trivial considerations. In most larger organizations, the data is large enough that there is always "one more" table in the database which could be tapped for candidate predictors. Likewise, how the model might best be positioned as part of the total solution is not always obvious, especially in more complex problems. A minority of contests permit the use of outside data, but even this is somewhat unrealistic since real organizations have budgets for the purchase of outside data, such as demographic data to be appended to a customer population. I've yet to learn of anyone paying for outside variables to append to competition data, though.

Another issue is the large number of competitors which these contests attract. Though it is good to have many analysts take a crack at a problem, one must wonder about the statistical significance of having hundreds of statisticians test God-only-knows how many hypotheses against the same data. Further, the number of competitors and the similarity of top contestants' performance figures make selection of a single "winner" a dubious proposition.

Finally, it has become rather common for winners of the contests to construct solutions of vast proportions- typically ensembles of gigantic number of base models. While such models may be feasible to deploy in some circumstances, they far too computationally demanding to execute on many real databases quickly enough to be practical.

Some of these criticisms are probably unavoidable, especially the ones regarding the pre-selected, pre-digested contest data. Still, it'd be interesting to see future data mining competitions address at least some of these issues. For one thing, it might be interesting to see solution sizes (lines of SQL or C++ or something similar) limited to something which ordinary IT departments would be capable of executing during a typical overnight run. Averaging across an increased number of tasks might begin to improve the significance of differences among contestants' performances.

4 comments:

Anonymous said...

I haven't seen any winning solutions on kaggle that a moderately skilled IT person couldn't implement in a day with AWS alone. Many of the winners share their code in their followup interview: http://blog.kaggle.com/
I have yet to see an implementation that would be restricted by a company's IT bandwidth.

Will Dwinnell said...

Even on a conventional system, in something like C++ or SQL, this will depend on a number of factors, such as implementation cost and required recall time. Note that the winning entry of the much-vaunted Netflix Prize was never implemented for eactly this reason.

Further, many businesses rely on turnkey solutions from vendors which provide much more limited math and logic capability. In fact, one system I've worked with severely limits the precision of values (this would include incoming values, coefficients and intermediate calculations), forcing the people who customize the system to preemptively scale values which might underflow, then scale back when they're done.

Sandro Saitta said...

I definitely agree with your post Will. I discuss this topic in the forthcoming bulletin of the Swiss Statistical Society.

Although very useful, a standard competition focus on modelling, which represent 5-10% of a data mining project in industry. A competition won't (yet) cover critical steps such as business understanding and deployment challenges.

Unknown said...

I agree with the comments on this post. These competitions exlude some of the important processes that any DM practitioners need to go through (data processing, feature selection, cost estimation, etc.) in order to allow partiticpants to focus just on model performance. Applying models and algorithms to the provided data is not so hard nowadays if substantial computational power is available. Therefore, running models or model emsambles by sweeping through parameters to find the sweet spot using machine learning toolkits and packages might not be so interesting to those in the industry. As mentioned, what people really want to know is the entire DM processes from data cleaning to deployment accomplished within a resonable amount of time, computational power, and cost. I hope that some competitions will start tasks which are applicable to more realistic business situations, such as the one suggested at the end of the post.