Thursday, April 17, 2008

Data Mining survey

Karl Rexer of Rexer Analytics conducted an extensive survey of data miners in 2007, and reported on those results here at Quirks.com (a site I had never heard of before--unfortunately, you have to register to see it).

This is not to be confused with their 2008 survey, results due out soon I would expect.

A few interesting items in the survey results:

• Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.


I find it interesting in of itself that academics are participating in a data mining survey, and I don't mean that in a negative way. I have viewed data mining more as a business-centric way of thinking, and to have regression advocates participate in a survey of this type is a good sign. Of course it could also mean that business folks don't have the time to fill out surveys :)

• SPSS, SPSS Clementine, and SAS are the three most frequently utilized analytic tools and were each used in 2006 by more than 40 percent of data miners. Forty-five percent of data miners also employed their own code in 2006. Respondents were asked about 26 different software packages from the powerhouses above to less-visible and -utilized packages such as Chordiant, Fair Isaac and KXEN.


Clementine usually shows up at the top of the KDNuggets survey, and I've never been sure if it was because of the kdnuggets typical user, or if it reflected true general use in the data mining community. This gives further evidence that its use is more widespread. The fact that SPSS and SAS are the others show the dominance in the survey of statisticians or acamedicians. I rarely find heavy SPSS or SAS users among technical business analysts.

• Comparisons of reported 2006 use and planned 2007 use show that there is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5. It will be interesting to see how these trends develop over time and if other tools find greater prominence in the future.


I concur from my experience. I would put SQL Server in that category as well. I think the C4.5 popularity was largely due to licensing.

• The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities. Data miners were least interested in the reputation of the software and the software’s compatibility either with other programs or with software used by colleagues.


THis looks like the responses of technical people--very much common sense. I wonder what decision makers would say? Reputation I would think ranks much higher among these people.

• The top challenges facing data miners are dirty data, data access and explaining data mining to others. Over three-quarters of data miners listed dirty data as one of the major challenges that they face. This is again consistent with our own experience and the conventional wisdom discussed at data mining conferences: a significant proportion of most projects consist of data understanding, data cleaning and data preparation.


No surprises here! However, once one goes through this process, its importance is reduced (because it is solved).

Thanks to Rexer Analytics for putting this and the 2008 survey together. I'm looking forward to those results.

4 comments:

n said...
This comment has been removed by the author.
Anonymous said...

Thanks for the summary and your comments. Recently discovered your blog, cool stuff! Kudos to you, awesome work.

Sandro Saitta said...

"Of course it could also mean that business folks don't have the time to fill out surveys :)"

But they still have time to blog ;-)

More seriously, thanks for the summary. Usually (but it's of course not a universal truth), the techniques used now in industry, are the ones that have been developed 20 years ago in academic. So I just find the result regarding the use of regression in academic a bit strange (it certainly means academic as a whole and not data mining and machine learning researchers).

GS said...

As a part of my research I used WEKA Data Mining tool in grad school. I could download the code for decision trees or classification algorithms and make my own changes too which I did.

Industry specific tools would cater to industry related priorities such as scalability, volume of data sets, easeness of use (you don't need to know how decision trees work internally, but you use them easily), visualization of results with graphs etc. (which lacks in open source free softwares such as WEKA).

Data Mining concepts/algorithms originated through academia. Research is being carried out heavily on some of these algorithms, how they can apply to different domains and so on. Academia focusses on academic contributions. With respect to survey, I think, More than using the tools and finding out how good the tool is, academic contributions might involve contributions to the algorithms etc. behind the tool.