Predictive Modeling competitions, once the arena for a few data mining conferences, has now become big business. Kaggle (kaggle.com) is perhaps the most well-known forum for modeling competitions, using a crowd-sourcing mentality: if more people try to solve a problem, the likelihood that someone will create an excellent solution to that problem increases.
The participants, and there have been 10s of thousands of participants since their 2011 beginning, sometimes have no predictive modeling background and sometimes an extensive data science background. Some very clever algorithms and solutions have been developed with, on some occasions, ground-breaking results
One conclusion to draw from these competitions is that what we need in the predictive analytics space is more data scientists with different, innovative ideas for solving problems, and perhaps more in-depth training of data scientists so they can create these innovative solutions. After all, the Netflix prize winner created a solution that was an ensemble of model ensembles, comprised of hundreds of models (not a Kaggle competition, but one created by and for Netflix).
This idea of the importance of machine learning expertise was the topic of a Strata conference debate in 2012, tackling the question, “which is more important, domain expertise or machine learning expertise”, or the way it was phrased for the debate, “who should your first hire be: a domain expert or data scientist?”
The conclusion of the majority at the Strata conference was the machine learning is more important, but even the moderator, Mark Driscoll, concluded the following, “Could you currently prepare your data for a Kaggle competition? If so, then hire a machine learner. If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.” (http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine)
The point is that defining the competition objectives and the data needed to solve the problem is critically important. Non-domain experts, the data scientists, can not ever hope to understand the domain well enough to determine what the most effective question to answer would be, where to find the data to build a modeling data set, what the target variable should be, and how one should assess which model is best. These are business domain specific.
Even companies building the same kinds of models, let’s say customer retention or churn, will approach them differently depending on the kind of business, the lead time needed to act on potential churners, and the metrics for churn that relate to ROI for that company. I’ve build models for companies in the same domain area that took very different approaches; even though I had some domain experience from customer 1, that didn’t translate into developing business objectives well for company 2.
It’s the partnership that matters. I often think of these partnerships within an organization as the three-legged stool, all of which are needed for the modeling project to succeed: a business stakeholder who understands what business objectives matter to the company and how to articulate them, IT staff who know where the data is, what it means, and how to access it, and the analysts who know how to take the data and the business objectives and translate them into modeling objectives that address the business problem. Without all three, projects fail. We modelers could build the best models in the world that solve the wrong problem exceedingly well!
(first posted at http://www.predictiveanalyticsworld.com/patimes/a-good-business-objective-beats-a-good-algorithm/)