Applied Data Science and Machine Learning: data mining software

Wednesday, May 28, 2008

What data mining software to buy?

This post (http://www.dmreview.com/issues/2007_46/10001040-1.html?portal=analytics) is an interesting example of the assessment of analytics software. The key paragraph is the conclusion where Mr. Raab states

Instead of a horserace between product features, this approach puts the focus where it should be: on value to your business. It recognizes that the value of a new tool depends on the other tools already available, and it forces evaluation teams to explicitly study the impact of different tools on different users. By creating a clearer picture of how each new tool will impact the way work actually gets done within the company, it leads to more realistic product assessments and ultimately to more productive selection choices.

I couldn't agree more. For the past 10 years, since the Elder and Abbott review of data mining software presented at KDD-98 (on my web site) I've tried to think of ways to summarize data mining software. The obvious way is by features, such as which algorithms a product has. The usability of a tool is another characteristic to add, as John, Philip Matkovsky and I wrote about in "An Evaluation of High-End Data Mining Tools for Fraud Detection". I've also described the different packages by the kind of interface (wizard, menu-driven, block-diagram, command line, etc.).

It's not easy to provide a summary in this multi-dimensional view of data mining tools. Sounds like an opportunity for predictive modeling!

Thursday, April 17, 2008

Data Mining survey

Karl Rexer of Rexer Analytics conducted an extensive survey of data miners in 2007, and reported on those results here at Quirks.com (a site I had never heard of before--unfortunately, you have to register to see it).

This is not to be confused with their 2008 survey, results due out soon I would expect.

A few interesting items in the survey results:

• Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.

I find it interesting in of itself that academics are participating in a data mining survey, and I don't mean that in a negative way. I have viewed data mining more as a business-centric way of thinking, and to have regression advocates participate in a survey of this type is a good sign. Of course it could also mean that business folks don't have the time to fill out surveys :)

• SPSS, SPSS Clementine, and SAS are the three most frequently utilized analytic tools and were each used in 2006 by more than 40 percent of data miners. Forty-five percent of data miners also employed their own code in 2006. Respondents were asked about 26 different software packages from the powerhouses above to less-visible and -utilized packages such as Chordiant, Fair Isaac and KXEN.

Clementine usually shows up at the top of the KDNuggets survey, and I've never been sure if it was because of the kdnuggets typical user, or if it reflected true general use in the data mining community. This gives further evidence that its use is more widespread. The fact that SPSS and SAS are the others show the dominance in the survey of statisticians or acamedicians. I rarely find heavy SPSS or SAS users among technical business analysts.

• Comparisons of reported 2006 use and planned 2007 use show that there is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5. It will be interesting to see how these trends develop over time and if other tools find greater prominence in the future.

I concur from my experience. I would put SQL Server in that category as well. I think the C4.5 popularity was largely due to licensing.

• The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities. Data miners were least interested in the reputation of the software and the software’s compatibility either with other programs or with software used by colleagues.

THis looks like the responses of technical people--very much common sense. I wonder what decision makers would say? Reputation I would think ranks much higher among these people.

• The top challenges facing data miners are dirty data, data access and explaining data mining to others. Over three-quarters of data miners listed dirty data as one of the major challenges that they face. This is again consistent with our own experience and the conventional wisdom discussed at data mining conferences: a significant proportion of most projects consist of data understanding, data cleaning and data preparation.

No surprises here! However, once one goes through this process, its importance is reduced (because it is solved).

Thanks to Rexer Analytics for putting this and the 2008 survey together. I'm looking forward to those results.

Applied Data Science and
Machine Learning

Wednesday, May 28, 2008

What data mining software to buy?

Thursday, April 17, 2008

Data Mining survey

Applied Predictive Analytics

Contributors

Our Web Sites

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and Machine Learning

Wednesday, May 28, 2008

What data mining software to buy?

Thursday, April 17, 2008

Data Mining survey

Applied Predictive Analytics

Contributors

Our Web Sites

Subscribe To This Blog

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and
Machine Learning