Of the resources consumed in data mining projects, the most precious
(read: "expensive") is time, especially the time of the human analyst.
Hence, a significant question for the analyst is how best to allocate
his or her time.
Long and continuing experience
indicates clearly that the most productive use of time in such work is
that dedicated to data preparation. I apologize if this seems like an
old topic to the reader, but it is an important lesson which seems to be
forgotten annually, as each new technical development presents itself. A
surprising number of authors- particularly the on-line variety- come to
the conclusion that "the latest thing"* will spare us from needing to
prepare and enhance the data.
I offer, as yet another
data point in favor of this perspective a recent conversation I had
with a colleague. He and a small team conducted parallel modeling
efforts for a shared client. Using the same base data, they constructed
separate predictive models. His model and theirs achieved similar test
performance. The team used a random forest, while he used
logistic regression, one of the simplest modeling techniques. The team
was perplexed at the similarity in model performance. My associate asked
them how they had handled missing values. They responded that they
filled them in. He asked exactly how they had filled the missing values.
The response was that they set them all to zeros (!). By not taking the
time and effort to comprehensively address this issue, they had forced
their model to do the significant extra work of filling in these gaps
itself. Consider that some fraction of their data budget was spent on fixing this mistake, rather than being used to create a better model. Note, too, that it is far easier (less code, less input
variables to monitor, less to go wrong) to deploy a modestly-sized
logistic regression than any random forest.
Given this
context, it is curious to note that so much of what is published (again,
especially on-line; think of titles such as: "The 10 Learning
Algorithms Every Data Scientist Must Know") and so many job listings
emphasize- almost to the point of exclusivity- learning algorithms, as
opposed to practical questions of data sampling, data preparation and
enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.
*
For "the latest thing", you may fill in, variously, neural networks,
decision trees, SVM, random forests, GPUs, deep learning or whatever
comes out as next year's "next big thing".