Of the resources consumed in data mining projects, the most precious
(read: "expensive") is time, especially the time of the human analyst.
Hence, a significant question for the analyst is how best to allocate
his or her time.
Long and continuing experience
indicates clearly that the most productive use of time in such work is
that dedicated to data preparation. I apologize if this seems like an
old topic to the reader, but it is an important lesson which seems to be
forgotten annually, as each new technical development presents itself. A
surprising number of authors- particularly the on-line variety- come to
the conclusion that "the latest thing"* will spare us from needing to
prepare and enhance the data.
I offer, as yet another
data point in favor of this perspective a recent conversation I had
with a colleague. He and a small team conducted parallel modeling
efforts for a shared client. Using the same base data, they constructed
separate predictive models. His model and theirs achieved similar test
performance. The team used a random forest, while he used
logistic regression, one of the simplest modeling techniques. The team
was perplexed at the similarity in model performance. My associate asked
them how they had handled missing values. They responded that they
filled them in. He asked exactly how they had filled the missing values.
The response was that they set them all to zeros (!). By not taking the
time and effort to comprehensively address this issue, they had forced
their model to do the significant extra work of filling in these gaps
itself. Consider that some fraction of their data budget was spent on fixing this mistake, rather than being used to create a better model. Note, too, that it is far easier (less code, less input
variables to monitor, less to go wrong) to deploy a modestly-sized
logistic regression than any random forest.
Given this
context, it is curious to note that so much of what is published (again,
especially on-line; think of titles such as: "The 10 Learning
Algorithms Every Data Scientist Must Know") and so many job listings
emphasize- almost to the point of exclusivity- learning algorithms, as
opposed to practical questions of data sampling, data preparation and
enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.
*
For "the latest thing", you may fill in, variously, neural networks,
decision trees, SVM, random forests, GPUs, deep learning or whatever
comes out as next year's "next big thing".
If the most expensive resource is the time of the human analyst, how is it that letting the model do the significant extra work of filling in the gaps of the missing data itself a bad thing? I do not condone this as good practice (it’s terrible, I agree), but the time of the human analyst is saved when they don’t have to deal with the missing values themselves.
ReplyDeleteOf course, I am assuming that the time the model takes to fill in these missing values is no longer than it would take for the human analyst to do something more clever than filling in the missing values with zeros himself. If the results of both models achieve similar test performance, there isn’t much motivation to deal with the missing values in a better way.
I do agree that it is more important to understand the data you are working with and the scalability of your solution than it is to know of different algorithms you can throw at it. But how do you think that this importance can be conveyed over the shout of the “latest thing”? Should data scientists focus more on statistics? Does the tone of the entire science need to change? Even though the existence of a universal learner has been proven not to exist (No Free Lunch), it seems like that is what everyone is trying to find.
Why do so many people focus, as you say, “almost to the point of exclusivity,” on the learning algorithms, especially the latest and the greatest? As you point out, this is a problem, but I don’t think it is limited to the narrow scope of data mining algorithms. Consider this statement by Karl Popper, “we are not students of some subject matter, but students of problems. And problems may cut right across the borders of any subject matter or discipline.” While each of us have chosen a specific discipline or subject matter, ultimately we are, or at least should be, seeking to advance mankind’s knowledge, solving the problems we face, and discovering the problems that we do not yet know exist.
ReplyDeleteThere can sometimes exist a tendency to become myopic. When we solve a particular problem that relates to our discipline, we get excited and start to carve out in our minds what it means to belong to discipline X. Over time, as more problems are solved and papers are published, we think we understand our discipline, where its boundaries lie, and which problems do not belong. We begin to limit ourselves to the study of a set of permissible problems and we begin to accept only certain types of solutions, in our case, “the latest thing.” This is a great danger and we need to remember what Karl Popper said. For mankind to continue its pace of advancement and learning, we must seek out and embrace the study of problems whose solutions span the learning of multiple disciplines.
"Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation." -
ReplyDeleteI recently participated in the Kaggle Toxic Comments competition, and I was really surprised by two things: 1. everyone in the top 1000 entries or so had above 95% accuracy, and 2. the teams who won basically used the same models as everyone else. The thing that gave them their competitive edge was data preparation. They augmented their comments by translating them into different languages in order to get more data. They assigned pseudo-labels to the test data because they noticed the test and train sets followed pretty different distributions. While the rest of us focused on throwing another model at the data and ensembling a large number of models, the winners took time to prepare a solid dataset.
I've really enjoyed following the "Tidy Data" movement in the R community. Tools like Weka / scikit-learn / MLR are making it much easier to chuck data into the "latest thing" model, and there are some really awesome tools in the tidyverse that facilitate manipulating data into formats. But I see intelligent data preparation as a major aspect of data mining that will be much more difficult to automate.