Monday, June 13, 2011

What do Data Miners Need to Learn?

I've been asked by several folks recently what they need to learn to succeed in data mining and predictive analytics. This is a different twist on the question I also get, namely what degree should one get to be a good (albeit "green") data miner. Usually, the latter question gets the answer "it doesn't matter" because I know so many great data miners without a statistics or mathematics degree. Understandably, there are many non-stats/math degrees that have a very strong statistics or mathematics component, such as psychology, political science, and engineering to name a few. But then again, you don't necessarily have to load up on the stats/math courses in these disciplines either.

So the question of "what to learn" applies across majors whether undergraduate or graduate. Of course statistics and machine learning courses are directly applicable. However, the answer I've been giving recently to the question what do new data miners need to learn (assuming they will learn algorithms) have centered around two other topics: databases and business.

I had no specific coursework or experience in either when I began my career. In the 80s, databases were not as commonplace in the DoD world where I began my career; we usually worked with flat files provided to us by a customer, even if these files were quite large. Now, most customers I work with have their data stored in databases or data marts, and as a result, we data miners often must lean on DBAs or an IT layer of people to get at the data. This would be fine except that (1) the data that is provided to data miners is often not the complete data we need or at least would like to have before building models, (2) we sometimes won't know how valuable data is until we look at it, and (3) communication with IT is often slow and laden with political issues inherent in many organizations.

On the other hand, IT is often reticent to give analysts significant freedom to query databases because of the harm they can do (wise!) because data miners have in general a poor understanding of how databases work and which queries are dangerous or computationally expensive.

Therefore, I am becoming more of the opinion that a masters program in data mining, or a data mining certificate program should contain at least one course on databases, which should contain at least some database design component, but for the most part should emphasize a users perspective). It is probably more realistic to require this for a degree than a certificate, but could be included in both. I know that for me, in considering new hires, this would be provide a candidate an advantage for me if he or she had SQL or SAS experience.

For the second issue, business experience, there are some that might be concerned that "experience" is too narrow for a degree program. After all, if someone has experience in building response models, what good would that do for Paypal if they are looking for building fraud models? My reply is "a lot"! Building models on real data (meaning messy) to solve a real problem (meaning identifying a target variable that conveys the business decision to be improved) requires a thought process that isn't related to knowing algorithms or data.

Building "real-world" models requires a translation of business objectives to data mining objectives (as described in the Business Understanding section of CRISP-DM, pdf here). When I have interviewed young data miners in the past, it is those who have had to go through this process that are better prepared to begin the job right away, and it is those who recognize the value here who do better at solving problems in a way that impacts decisions rather than finding cool, innovative solutions that never see the light of day. (UPDATE: the crisp-dm.org site is no longer up--see comments section. The CRISP-DM 1.0 document however can still be downloaded here, with higher resolution graphics, by the way!)

My challenge to the universities who are adding degree programs in data mining and predictive analytics, or are offering Certificate programs is then to include courses on how to access data (databases), and how to solve problems (business objectives, perhaps by offering a practicum with a local company).

20 comments:

dlayne76 said...

Hi Dean,

The new MS in Predictive Analytics at Northwestern includes database design as well as project management. I am considering the program. Although, with 10 years software development and database experience, this part of the program may be redundant to me. Actually, if you had some free time, I would love you hear your initial impression of the program.

http://www.predictive-analytics.northwestern.edu/curriculum/

Best,

Deron

Dean Abbott said...

Deron:

Interestingly, I just had my first contact with Northwestern a few weeks ago, which in part prompted this post. I'm waiting to hear more from them directly about the program and what they plan on covering. I also recently had a conversation with a market research company about another university that is planning on a master program in statistics and analytics where we discussed the exact same issues. Algorithms are the easy part (relatively).

Applying algorithms to specific business problems (business understanding in the CRISP-DM terminology) and the pull/push from/to the database are more challenging logistically.

I'll be happy to post on what I find out.

Eric Flores said...

Excellent article. I just got confused with the OR is 'SQL or SAS', as one is not substitute for the other.

I relate the Business Understanding part to the typical IS Business Analysis, which takes care of pulling the idea from the business user's brain (sometimes in the hard way) and stating it as discrete, testable requirements.

I have found that most projects at the MS in Data Mining at CCSU require you to build the Business Analysis skills. Projects resemble a real business situation and students collaborate to reach a solution (even that all work is individual).

http://web.ccsu.edu/datamining/master.html

The course in Database Systems is an elective course - which I appreciate. Having over 15 years of experience designing database systems, I prefer to spend that time learning SAS or Web Mining.

sewa mobil said...

Nice article, thanks for the information.

Kristin said...

Hi Dean,

Way back when I did my MS at NCSU, we took a year of consulting. Each stat student was assigned to a stat prof who consulted around the university with people conducting research for publication. It was invaluable. I learned how to reframe questions to get answers, how to figure out the true goal of the problem, and that the analysis the client thinks they want is not always what will shed light on the problem...but sometimes you have to do it their way and another way to get them to see the light. I also learned to pose questions in terms of the expected insights being sought, such as, when we complete the analysis if the answer is 'x' what decision will you make? Alternatively, if the answer is 'y' what decision will you make? And, what is preventing you from making that decision now, without any analysis?

Krish said...

Very nice article, Dean. I liked the callout for understanding technical subjects like database design as an important skill for data mining aspirants to have.

Building on the same theme, another important area of understanding is overall systems design and particularly, regarding the robustness of systems. Ultimately a data mining solution will presumably fit within an IT production environment and these environments typically operate under constraints. Things such as processing speed, storage space, speed of update. Also they have upstream and downstream dependencies that a data miner would do well to be aware of. A well designed course would at least spend some time on the system engineering aspects.

Best, Krish

Anonymous said...

Looks like the link in the post (Business Understanding section of CRISP-DM) is broken.

Dean Abbott said...

"Anonymous": Less than a week ago, it appears that the CRISP-DM.org web site was redirected to IBM. SPSS owned the domain for all of this time, and I assume that IBM took ownership of the domain after the acquisition. Why crisp-dm.org has been taken down, I don't know (though am trying to find out).

I'll update the post to include the new URL to the CRISP-DM document (which looks like it has been "freshened up"). It is here: ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf

Dean

Anonymous said...

I agree: there is no course that will teach you how to translate business objectives into a computational algorithm. But I believe that deep understanding of kernel methods are a must for a serious data miner.

Anonymous said...

Nice article. We're iterating on a new predictive analytics tool and looking for early adapters to private beta the tool. Our initial use case is to "predict anything" about underbanked/thin file consumers. If interested, learn more at http://bit.ly/q5C0IZ

Arch said...

Hi Dean, What is your suggestion for folks who are data engineers who want to do some of the predictive analysis that you have outlined here? We have a number of Comp Sc engineers with strong db and data warehousing background who dont have enough of a stats background to do forecasting and analytics. They do basic reporting just fine and are very fast at getting the data and understanding it. But demand smoothing, time series analysis, hadoop etc etc, how does one go about learning this best? Thanks for any pointers you can provide.

Arch said...

Also the course at Northwestern is exactly what we need, but its a full blown MS and very costly and we dont have a company match. Is there some certificate program in predictive analysis and stats that someone could take? We contacted Northwestern and this course is brand new in its first year now. So no alumni for us to recruit from yet.

ranjini said...

I actually enjoyed reading through this posting.Many thanks.




Embedded Systems Course

Amr said...

Hi Dean,

I'm exploring the possibility of making a career transition into data mining and predictive analytics (I've worked in bioinformatics in a genomic sequencing center for ten years).

I was wondering if you'd be willing to exchange a few e-mails with me to help with my exploration.

If so, please e-mail me at osiris1975 at verizon dot net or leave me an address I can contact you at here. Thanks!

Clark Miners said...

Your article is very good and very useful for us, thank you for giving information very useful and very valuable to us, may you continue to provide information and provide insight that is always helpful to us

hiRegards

Clark Miners

Anonymous said...

One thing I would recommend is some teaching of the difference between machine learning and parametric statistics. The null hypothesis and best performance of a properly structured SVM bound the real predictive capability of the data, assuming the problem is set correctly. Also a lesson in sensitivity, specificity, positive and negative predictive value add a lot to one's perspective

Dean Abbott said...

Anonymous:

Parametric methods vs. machine learning in many ways is at the core of the differences in the great statistics vs. data mining divide! I do think that it is useful to understand the strengths and weaknesses of both (though I am clearly in the ML camp).

Regarding sensitivity and specificity, these are almost always covered in a good data mining course (often in discussions of ROC curves).

Anonymous said...

I started as a consultant at the "economics of the firm" level and found that understanding a business problem in terms of an optimize subject to constraints problem has been invaluable. Data scientists need to learn how to consult understanding out of the intuitive experts.

jesus said...




Thanks for your excellent guide man





Data Mining

Unknown said...

Excellent post. If anyone's interested in learning more about data center management software please check out the resource below.

Data center management software