Predictive Modeling competitions, once the arena for a few data mining conferences, has now become big business. Kaggle (kaggle.com) is perhaps the most well-known forum for modeling competitions, using a crowd-sourcing mentality: if more people try to solve a problem, the likelihood that someone will create an excellent solution to that problem increases.
The participants, and there have been 10s of thousands of participants since their 2011 beginning, sometimes have no predictive modeling background and sometimes an extensive data science background. Some very clever algorithms and solutions have been developed with, on some occasions, ground-breaking results
One conclusion to draw from these competitions is that what we need in the predictive analytics space is more data scientists with different, innovative ideas for solving problems, and perhaps more in-depth training of data scientists so they can create these innovative solutions. After all, the Netflix prize winner created a solution that was an ensemble of model ensembles, comprised of hundreds of models (not a Kaggle competition, but one created by and for Netflix).
This idea of the importance of machine learning expertise was the topic of a Strata conference debate in 2012, tackling the question, “which is more important, domain expertise or machine learning expertise”, or the way it was phrased for the debate, “who should your first hire be: a domain expert or data scientist?”
The conclusion of the majority at the Strata conference was the machine learning is more important, but even the moderator, Mark Driscoll, concluded the following,
“Could you currently prepare your data for a Kaggle competition? If so, then hire a machine learner. If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.” (http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine)
“Could you currently prepare your data for a Kaggle competition? If so, then hire a machine learner. If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.” (http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine)
The point is that defining the competition objectives and the data needed to solve the problem is critically important. Non-domain experts, the data scientists, can not ever hope to understand the domain well enough to determine what the most effective question to answer would be, where to find the data to build a modeling data set, what the target variable should be, and how one should assess which model is best. These are business domain specific.
Even companies building the same kinds of models, let’s say customer retention or churn, will approach them differently depending on the kind of business, the lead time needed to act on potential churners, and the metrics for churn that relate to ROI for that company. I’ve build models for companies in the same domain area that took very different approaches; even though I had some domain experience from customer 1, that didn’t translate into developing business objectives well for company 2.
It’s the partnership that matters. I often think of these partnerships within an organization as the three-legged stool, all of which are needed for the modeling project to succeed: a business stakeholder who understands what business objectives matter to the company and how to articulate them, IT staff who know where the data is, what it means, and how to access it, and the analysts who know how to take the data and the business objectives and translate them into modeling objectives that address the business problem. Without all three, projects fail. We modelers could build the best models in the world that solve the wrong problem exceedingly well!
(first posted at http://www.predictiveanalyticsworld.com/patimes/a-good-business-objective-beats-a-good-algorithm/)
4 comments:
Wow, I'd never heard of kaggle.com before. I just took a peek, and I'd really love to take a week off work and just play with Flight Quest 2!
Also, since when does Mark Driscoll know anything about big data?
Kaggle has been tremendously successful--glad that you are interested (if you ever would have time to give it a go). The amazing thing is that winners do not necessarily have any formal statistics or mathematics background; some have good ideas that are unorthodox. I think you'd do well!
I particularly like Mark's dual personality :)
I completely agree that data mining requires the cooperation of both domain experts and machine learning experts. I actually think that the question, “which is more important, domain expertise or machine learning expertise”, shows a flawed attitude. It is impossible to say that one is more or less important than the other. They both have their roles to play and without each other they will reach flawed outcomes.
Still, I think it is particularly important that domain experts are around to help both initially prepare the data and to evaluate the meaning of the learned model. Much data in the business world is dirty and it is essential that domain experts are available to oversee the cleaning of the data and reduce the amount of bias introduced during cleaning. It is also very important that they are available to provide cost metrics to the machine learning expert and also examine the results and evaluate whether these statistically significant results are actually usable to the business.
One of my favorite stories in this regard was shared during my undergraduate education. As part of a data mining class, students were presented data from the university bookstore and were asked to analyze it however they saw fit. At the end of the semester they were extremely excited to share their results, as they had found three items that always sold together! When they presented their results they were told that the bookstore already knew the items sold together because they were only sold as a bundle. Thus the perfect result from the student’s model turned out to be perfectly useless. If only they had applied more domain knowledge through the process they may have been able to reach actually interesting results. This just goes to show that we need all three legs of the stool you mentioned.
Interesting article. One question that the article discuses is, “which is more important, domain expertise or machine learning expertise”, or, “who should your first hire be: a domain expert or data scientist?”
The question is a good one in that it underscores a very important aspect of data mining that is dealt with every day in industry, but with all due respect, it seems a bit of of brainer to me. Each is essential but not sufficient.
A domain specialist naively throwing data into an algorithm and hoping for the best is likely to get a interesting but invalid result, while the data scientist who knows nothing about the data is likely to get a valid but not useful result.
Post a Comment