Monday, March 20, 2017

A Question of Resource Allocation

Of the resources consumed in data mining projects, the most precious (read: "expensive") is time, especially the time of the human analyst. Hence, a significant question for the analyst is how best to allocate his or her time.

Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation. I apologize if this seems like an old topic to the reader, but it is an important lesson which seems to be forgotten annually, as each new technical development presents itself. A surprising number of authors- particularly the on-line variety- come to the conclusion that "the latest thing"* will spare us from needing to prepare and enhance the data.

I offer, as yet another data point in favor of this perspective a recent conversation I had with a colleague. He and a small team conducted parallel modeling efforts for a shared client. Using the same base data, they constructed separate predictive models. His model and theirs achieved similar test performance. The team used a random forest, while he used logistic regression, one of the simplest modeling techniques. The team was perplexed at the similarity in model performance. My associate asked them how they had handled missing values. They responded that they filled them in. He asked exactly how they had filled the missing values. The response was that they set them all to zeros (!). By not taking the time and effort to comprehensively address this issue, they had forced their model to do the significant extra work of filling in these gaps itself. Consider that some fraction of their data budget was spent on fixing this mistake, rather than being used to create a better model. Note, too, that it is far easier (less code, less input variables to monitor, less to go wrong) to deploy a modestly-sized logistic regression than any random forest.

Given this context, it is curious to note that so much of what is published (again, especially on-line; think of titles such as: "The 10 Learning Algorithms Every Data Scientist Must Know") and so many job listings emphasize- almost to the point of exclusivity- learning algorithms, as opposed to practical questions of data sampling, data preparation and enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.

* For "the latest thing", you may fill in, variously, neural networks, decision trees, SVM, random forests, GPUs, deep learning or whatever comes out as next year's "next big thing".


Jane said...

If the most expensive resource is the time of the human analyst, how is it that letting the model do the significant extra work of filling in the gaps of the missing data itself a bad thing? I do not condone this as good practice (it’s terrible, I agree), but the time of the human analyst is saved when they don’t have to deal with the missing values themselves.

Of course, I am assuming that the time the model takes to fill in these missing values is no longer than it would take for the human analyst to do something more clever than filling in the missing values with zeros himself. If the results of both models achieve similar test performance, there isn’t much motivation to deal with the missing values in a better way.

I do agree that it is more important to understand the data you are working with and the scalability of your solution than it is to know of different algorithms you can throw at it. But how do you think that this importance can be conveyed over the shout of the “latest thing”? Should data scientists focus more on statistics? Does the tone of the entire science need to change? Even though the existence of a universal learner has been proven not to exist (No Free Lunch), it seems like that is what everyone is trying to find.

Brandon said...

Why do so many people focus, as you say, “almost to the point of exclusivity,” on the learning algorithms, especially the latest and the greatest? As you point out, this is a problem, but I don’t think it is limited to the narrow scope of data mining algorithms. Consider this statement by Karl Popper, “we are not students of some subject matter, but students of problems. And problems may cut right across the borders of any subject matter or discipline.” While each of us have chosen a specific discipline or subject matter, ultimately we are, or at least should be, seeking to advance mankind’s knowledge, solving the problems we face, and discovering the problems that we do not yet know exist.

There can sometimes exist a tendency to become myopic. When we solve a particular problem that relates to our discipline, we get excited and start to carve out in our minds what it means to belong to discipline X. Over time, as more problems are solved and papers are published, we think we understand our discipline, where its boundaries lie, and which problems do not belong. We begin to limit ourselves to the study of a set of permissible problems and we begin to accept only certain types of solutions, in our case, “the latest thing.” This is a great danger and we need to remember what Karl Popper said. For mankind to continue its pace of advancement and learning, we must seek out and embrace the study of problems whose solutions span the learning of multiple disciplines.

Unknown said...

"Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation." -

I recently participated in the Kaggle Toxic Comments competition, and I was really surprised by two things: 1. everyone in the top 1000 entries or so had above 95% accuracy, and 2. the teams who won basically used the same models as everyone else. The thing that gave them their competitive edge was data preparation. They augmented their comments by translating them into different languages in order to get more data. They assigned pseudo-labels to the test data because they noticed the test and train sets followed pretty different distributions. While the rest of us focused on throwing another model at the data and ensembling a large number of models, the winners took time to prepare a solid dataset.

I've really enjoyed following the "Tidy Data" movement in the R community. Tools like Weka / scikit-learn / MLR are making it much easier to chuck data into the "latest thing" model, and there are some really awesome tools in the tidyverse that facilitate manipulating data into formats. But I see intelligent data preparation as a major aspect of data mining that will be much more difficult to automate.

Coepd said...

We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.

Coepd BA Trainings said...

We at Coepd declared Data Science Internship Programs (Self sponsored) for professionals who want to have hands on experience. We are providing this program in alliance with IT Companies in COEPD Hyderabad premises. This program is dedicated to our unwavering participants predominantly acknowledging and appreciating the fact that they are on the path of making a career in Data Science discipline. This internship is designed to ensure that in addition to gaining the requisite theoretical knowledge, the readers gain sufficient hands-on practice and practical know-how to master the nitty-gritty of the Data Science profession. More than a training institute, COEPD today stands differentiated as a mission to help you "Build your dream career" - COEPD way.

Coepd BA Trainings said...

COEPD LLC- Center of Excellence for Professional Development is the most trusted online training platform to global participants. We are primarily a community of Business Analysts who have taken the initiative to facilitate professionals of IT or Non IT background with the finest quality training. Our trainings are delivered through interactive mode with illustrative scenarios, activities and case studies to help learners start a successful career. We impart knowledge keeping in view of the challenging situations individuals will face in the real time, so that they can handle their job deliverables with at most confidence.

Unknown said...

I recently completed this Data science certification training at ExcelR. I found this course very demanding. I learned a lot in this course. I was particularly impressed with the trainers which is the best feature of ExcelR. There is a wide breadth of topics covered in a short period of time. Love ExcelR.
Data science certification training bangalore

Tejuteju said...

awesome post presented by you..your writing style is fabulous and keep updated with your blogs Data Science online Course Bangalore

Unknown said...

All the latest updates from the Python Automationminds team. Python Automationminds lets you program in Python, in your browser. No need to install any software, just start coding straight away. There's a fully-functional web-based console and a programmer's text-editor
Phyton training in Chennai

Unknown said...

Robotic Process Automation (RPA) is one of the most exciting developments in Business Process Management (BPM) in recent history. Some industry experts believe it may be even more transformational than cloud computing transformational than cloud Automationminds team. (RPA)Automationminds lets you program in (RPA),
Robotic Process Automation course

Unknown said...

I have really happy to these reading your post. This product control and maintenance of our health.The daily routine can assist you weight lose quickly and safely.My life is completely reworked once I followed this diet.I feeling nice concerning myself.

Herbalife in Chennai
Herbalife Nutrition Products
Nutrition centers in Chennai
Weight Loss in Chennai
Weight Gain in Chennai

Unknown said...

Thank you for sharing wonderful information with us to get some idea about that content. check it once through
Best Machine Learning institute in Chennai | machine learning with python course in chennai

Ruby said...

I was very interested in the article , it’s quite inspiring I should admit. I like visiting your site since I always come across interesting articles like this one. Keep sharing! Regards. Read more about Big Data services

Unknown said...

BIG DATA Technologies provides you with a state of the art software which combines modern GPU technology (Graphic Processing Units) with the best practices in today’s Big Data platforms, providing up to 100x faster insights from data.
Bigdata Training in Chennai OMR

Unknown said...

Testers can build, enhance, and maintain scripts to regression test their mobile applications. Hands-on instruction is provided for those who want to explore the power of using Appium. The course covers content from installation to execution and reporting . The focus is on the practical application of Appium to resolve common mobile automated testing challenges. This course focuses on getting started with Appium.

pakescorts646 said...

The High Quality Pakistani Escorts Service Offering Best experienced Modern Females who know how to fulfill your suppressed desires . They are brilliant in work and the Girls of our Escorts Services in Pakistan will serve all your requirements with full of energy and they provide Best Girlfriend Experience We hope that you seek to find a dream beautiful ladies will become true Call us.