Monday, March 20, 2017

A Question of Resource Allocation

Of the resources consumed in data mining projects, the most precious (read: "expensive") is time, especially the time of the human analyst. Hence, a significant question for the analyst is how best to allocate his or her time.

Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation. I apologize if this seems like an old topic to the reader, but it is an important lesson which seems to be forgotten annually, as each new technical development presents itself. A surprising number of authors- particularly the on-line variety- come to the conclusion that "the latest thing"* will spare us from needing to prepare and enhance the data.

I offer, as yet another data point in favor of this perspective a recent conversation I had with a colleague. He and a small team conducted parallel modeling efforts for a shared client. Using the same base data, they constructed separate predictive models. His model and theirs achieved similar test performance. The team used a random forest, while he used logistic regression, one of the simplest modeling techniques. The team was perplexed at the similarity in model performance. My associate asked them how they had handled missing values. They responded that they filled them in. He asked exactly how they had filled the missing values. The response was that they set them all to zeros (!). By not taking the time and effort to comprehensively address this issue, they had forced their model to do the significant extra work of filling in these gaps itself. Consider that some fraction of their data budget was spent on fixing this mistake, rather than being used to create a better model. Note, too, that it is far easier (less code, less input variables to monitor, less to go wrong) to deploy a modestly-sized logistic regression than any random forest.

Given this context, it is curious to note that so much of what is published (again, especially on-line; think of titles such as: "The 10 Learning Algorithms Every Data Scientist Must Know") and so many job listings emphasize- almost to the point of exclusivity- learning algorithms, as opposed to practical questions of data sampling, data preparation and enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.

* For "the latest thing", you may fill in, variously, neural networks, decision trees, SVM, random forests, GPUs, deep learning or whatever comes out as next year's "next big thing".


Jane said...

If the most expensive resource is the time of the human analyst, how is it that letting the model do the significant extra work of filling in the gaps of the missing data itself a bad thing? I do not condone this as good practice (it’s terrible, I agree), but the time of the human analyst is saved when they don’t have to deal with the missing values themselves.

Of course, I am assuming that the time the model takes to fill in these missing values is no longer than it would take for the human analyst to do something more clever than filling in the missing values with zeros himself. If the results of both models achieve similar test performance, there isn’t much motivation to deal with the missing values in a better way.

I do agree that it is more important to understand the data you are working with and the scalability of your solution than it is to know of different algorithms you can throw at it. But how do you think that this importance can be conveyed over the shout of the “latest thing”? Should data scientists focus more on statistics? Does the tone of the entire science need to change? Even though the existence of a universal learner has been proven not to exist (No Free Lunch), it seems like that is what everyone is trying to find.

Brandon said...

Why do so many people focus, as you say, “almost to the point of exclusivity,” on the learning algorithms, especially the latest and the greatest? As you point out, this is a problem, but I don’t think it is limited to the narrow scope of data mining algorithms. Consider this statement by Karl Popper, “we are not students of some subject matter, but students of problems. And problems may cut right across the borders of any subject matter or discipline.” While each of us have chosen a specific discipline or subject matter, ultimately we are, or at least should be, seeking to advance mankind’s knowledge, solving the problems we face, and discovering the problems that we do not yet know exist.

There can sometimes exist a tendency to become myopic. When we solve a particular problem that relates to our discipline, we get excited and start to carve out in our minds what it means to belong to discipline X. Over time, as more problems are solved and papers are published, we think we understand our discipline, where its boundaries lie, and which problems do not belong. We begin to limit ourselves to the study of a set of permissible problems and we begin to accept only certain types of solutions, in our case, “the latest thing.” This is a great danger and we need to remember what Karl Popper said. For mankind to continue its pace of advancement and learning, we must seek out and embrace the study of problems whose solutions span the learning of multiple disciplines.

Unknown said...

"Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation." -

I recently participated in the Kaggle Toxic Comments competition, and I was really surprised by two things: 1. everyone in the top 1000 entries or so had above 95% accuracy, and 2. the teams who won basically used the same models as everyone else. The thing that gave them their competitive edge was data preparation. They augmented their comments by translating them into different languages in order to get more data. They assigned pseudo-labels to the test data because they noticed the test and train sets followed pretty different distributions. While the rest of us focused on throwing another model at the data and ensembling a large number of models, the winners took time to prepare a solid dataset.

I've really enjoyed following the "Tidy Data" movement in the R community. Tools like Weka / scikit-learn / MLR are making it much easier to chuck data into the "latest thing" model, and there are some really awesome tools in the tidyverse that facilitate manipulating data into formats. But I see intelligent data preparation as a major aspect of data mining that will be much more difficult to automate.

Coepd said...

We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.

Coepd BA Trainings said...

We at Coepd declared Data Science Internship Programs (Self sponsored) for professionals who want to have hands on experience. We are providing this program in alliance with IT Companies in COEPD Hyderabad premises. This program is dedicated to our unwavering participants predominantly acknowledging and appreciating the fact that they are on the path of making a career in Data Science discipline. This internship is designed to ensure that in addition to gaining the requisite theoretical knowledge, the readers gain sufficient hands-on practice and practical know-how to master the nitty-gritty of the Data Science profession. More than a training institute, COEPD today stands differentiated as a mission to help you "Build your dream career" - COEPD way.

Coepd BA Trainings said...

COEPD LLC- Center of Excellence for Professional Development is the most trusted online training platform to global participants. We are primarily a community of Business Analysts who have taken the initiative to facilitate professionals of IT or Non IT background with the finest quality training. Our trainings are delivered through interactive mode with illustrative scenarios, activities and case studies to help learners start a successful career. We impart knowledge keeping in view of the challenging situations individuals will face in the real time, so that they can handle their job deliverables with at most confidence.

devi priya said...

I recently completed this Data science certification training at ExcelR. I found this course very demanding. I learned a lot in this course. I was particularly impressed with the trainers which is the best feature of ExcelR. There is a wide breadth of topics covered in a short period of time. Love ExcelR.
Data science certification training bangalore

Teju Teju said...

awesome post presented by you..your writing style is fabulous and keep updated with your blogs Data Science online Course Bangalore

keerthana keerthie said...

All the latest updates from the Python Automationminds team. Python Automationminds lets you program in Python, in your browser. No need to install any software, just start coding straight away. There's a fully-functional web-based console and a programmer's text-editor
Phyton training in Chennai

lokesh varan said...

Robotic Process Automation (RPA) is one of the most exciting developments in Business Process Management (BPM) in recent history. Some industry experts believe it may be even more transformational than cloud computing transformational than cloud Automationminds team. (RPA)Automationminds lets you program in (RPA),
Robotic Process Automation course

Arushi Chaya said...

I have really happy to these reading your post. This product control and maintenance of our health.The daily routine can assist you weight lose quickly and safely.My life is completely reworked once I followed this diet.I feeling nice concerning myself.

Herbalife in Chennai
Herbalife Nutrition Products
Nutrition centers in Chennai
Weight Loss in Chennai
Weight Gain in Chennai

Geetha Devi said...

Thank you for sharing wonderful information with us to get some idea about that content. check it once through
Best Machine Learning institute in Chennai | machine learning with python course in chennai

Ruby Gracie said...

I was very interested in the article , it’s quite inspiring I should admit. I like visiting your site since I always come across interesting articles like this one. Keep sharing! Regards. Read more about Big Data services

sujitha s said...

BIG DATA Technologies provides you with a state of the art software which combines modern GPU technology (Graphic Processing Units) with the best practices in today’s Big Data platforms, providing up to 100x faster insights from data.
Bigdata Training in Chennai OMR

suganya d.suganya said...

Testers can build, enhance, and maintain scripts to regression test their mobile applications. Hands-on instruction is provided for those who want to explore the power of using Appium. The course covers content from installation to execution and reporting . The focus is on the practical application of Appium to resolve common mobile automated testing challenges. This course focuses on getting started with Appium.

pakescorts646 said...

The High Quality Pakistani Escorts Service Offering Best experienced Modern Females who know how to fulfill your suppressed desires . They are brilliant in work and the Girls of our Escorts Services in Pakistan will serve all your requirements with full of energy and they provide Best Girlfriend Experience We hope that you seek to find a dream beautiful ladies will become true Call us.

prince arora said...

Amazing post and great effort. Your understanding of data science is fabulous. Plz keep sharing. Data Science Institutes in Pune

getha said...

As you have now understood the usage of ‘Record and Playback’ tool, the following are the different posts using which you can explore the functioning of ‘Selenium IDE’
selenium Training in chennai

Unknown said...

Excellent!! You provided very useful information in this article. I have read many articles in various sites but this article is giving in depth explanation about data science. Recently, I also took training on this “data science'' from Excelr.
data science certification

Suruchi Pandey said...

This article is genuinely noteworthy and admirable. I heartily admire you for making such a worthwhile piece of information accessible here publicly. Continue sharing this relevant guidance and continue updating. Professional Web design services are provided by W3BMINDS- Website designer in Lucknow.
Website Design Agency | Website design company in Lucknow

Dharani M said...

amazing post
data science training in Marathahalli

best data science courses in Marathahalli

data science institute in Marathahalli

data science certification Marathahalli

data analytics training in Marathahalli

data science training institute in Marathahalli

asha said...

Nice Article.... Nice post
data science training in bangalore

best data science courses in bangalore

data science institute in bangalore

data science certification bangalore

data analytics training in bangalore

data science training institute in bangalore

mounika said...

Nice post..

data science training in BTM

best data science courses in BTM

data science institute in BTM

data science certification BTM

data analytics training in BTM

data science training institute in BTM

logistic-solutions said...

Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
big data and analytics services in north america
Big Data Analytics Consulting for Hadoop
big data analytics consulting companies
big data and business analytic services
big data and analytic solutions
sap business intelligence solutions
sap business intelligence services

DJ PRASATH said...

Thanks for your post. This is excellent information. The list of your blogs is very helpful for those who want to learn, It is amazing!!! You have been helping many application.
best selenium training in chennai | best selenium training institute in chennai selenium training in chennai | best selenium training in chennai | selenium training in Velachery | selenium training in chennai omr | quora selenium training in chennai | selenium testing course fees | java and selenium training in chennai | best selenium training institute in chennai | best selenium training center in chennai

amsa leka said...

Wow!! Really a nice Article. Thank you so much for your efforts. Definitely, it will be helpful for others. I would like to follow your blog. Share more like this. Thanks Again.
iot training in Chennai | Best iot Training Institute in Chennai

tm-systems said...

Thanks for sharing information tm stagetec systems

hahaha said...

Link Alternatif S1288poker

S1288poker Judi Poker Online

CS S1288poker

S1288poker Indonesia

NettechIndia said...

these concept is good for these knowledge.I like it and help me to development very well.Thank you for this brief explanations.
python training in Mumbai

Zinavo said...

Thanks for sharing this informative blog..Keep posting
Website Design Company in Bangalore | Web Designing Company in Bangalore | Web Development Company in Bangalore

Darshan Kulkarni said...

Indeed an informative blog post. Thank you for this amazing article about recent trends in Big Data handling and resource allocation. It is important for a Big Data Analytics and Data management Services Company to stay updated with the trends.

Jacob said...

You made an article that is interesting.

Manisha singh said...

Its a wonderful post and very helpful, thanks for all this information. You are including better information.
Big Data Training in Gurgaon
Big Data Course in Gurgaon
Big Data Training institute in Gurgaon

Unknown said...

Well balanced course, great people, and fantastic environment. Thanks to ExcelR in providing me good knowledge in data science.
data science institute in bangalore

Unknown said...

Well balanced course, great people, and fantastic environment. Thanks to ExcelR in providing me good knowledge in data science.
business analytics course in bangalore

Unknown said...

I recently completed this hybris tutorial course at ExcelR. I found this course very demanding. I learned a lot in this course. I was particularly impressed with the trainers which is the best feature of ExcelR. There is a wide breadth of topics covered in a short period of time. Love ExcelR.
hybris tutorial

bhargavi said...

Such an informative blog.Keep sharing like these.

Data Science Training in Hyderabad
Hadoop Training in Hyderabad

bhargavi said...

Nice blog.Thank you for posting such a nice information.keep posting like these.

Data Science Training in Hyderabad
Hadoop Training in Hyderabad

devi said...

Excellent!! You provided very useful information in this article. I have read many articles in various sites but this article is giving in depth explanation about business analytics course. Recently, I also took training on this “business analytics course" from Excelr.
business analytics course

Goggly Life said...

Welcome to Sex in Lahore. In this escorts Provider agency, you will profit Escorts in Lahore. VIP Escorts is one of the most trusted escorts Provider Group and one of the biggest Escorts Agency in Lahore. Here are the definite Profiles of our Hottest and High-level Lahore for our adroitly-ordered Customers. Our focus is to boost your association happening for the tall level as soon as our VIP Model Escorts in Lahore. In our Website, you can reach the definite Profiles and information of our all Escorts in Lahore. All our New Hiring Lahore Escorts are regularly updating. There are countless Escorts in Lahore classified according to services. Here you can select Independent Escorts in Lahore for full Services of Escorts in Lahore. There are as well as many adult models the theater their services in our agency as Escorts in Lahore. Enjoy your period bearing in mind admiring Escorts in Lahore according to your endeavor.