Monday, March 20, 2017

A Question of Resource Allocation

Of the resources consumed in data mining projects, the most precious (read: "expensive") is time, especially the time of the human analyst. Hence, a significant question for the analyst is how best to allocate his or her time.

Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation. I apologize if this seems like an old topic to the reader, but it is an important lesson which seems to be forgotten annually, as each new technical development presents itself. A surprising number of authors- particularly the on-line variety- come to the conclusion that "the latest thing"* will spare us from needing to prepare and enhance the data.

I offer, as yet another data point in favor of this perspective a recent conversation I had with a colleague. He and a small team conducted parallel modeling efforts for a shared client. Using the same base data, they constructed separate predictive models. His model and theirs achieved similar test performance. The team used a random forest, while he used logistic regression, one of the simplest modeling techniques. The team was perplexed at the similarity in model performance. My associate asked them how they had handled missing values. They responded that they filled them in. He asked exactly how they had filled the missing values. The response was that they set them all to zeros (!). By not taking the time and effort to comprehensively address this issue, they had forced their model to do the significant extra work of filling in these gaps itself. Consider that some fraction of their data budget was spent on fixing this mistake, rather than being used to create a better model. Note, too, that it is far easier (less code, less input variables to monitor, less to go wrong) to deploy a modestly-sized logistic regression than any random forest.

Given this context, it is curious to note that so much of what is published (again, especially on-line; think of titles such as: "The 10 Learning Algorithms Every Data Scientist Must Know") and so many job listings emphasize- almost to the point of exclusivity- learning algorithms, as opposed to practical questions of data sampling, data preparation and enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.


* For "the latest thing", you may fill in, variously, neural networks, decision trees, SVM, random forests, GPUs, deep learning or whatever comes out as next year's "next big thing".