Thursday, May 24, 2007

KDnuggets 2007 Poll

The frenzy surrounding the annual software poll KDnuggets is finally over. The results are available at:

Data Mining / Analytic Software Tools (May 2007)

A number of statistical issues have been raised regarding this particular survey, but I will highlight only one here: The survey now includes separate counts for votes cast by people who voted for a single item, and those who voted for multiple items. Partially, this is in response to "get out the vote" efforts made by some vendors.

Anyway, some interesting highlights:

1. Free tools made a good showing. In the lead among free tools: Yale (103 votes).
2. "Your Own Code" (61 votes) did respectably well.
3. Despite not having data mining-specific components, MATLAB (30 votes), which is my favorite tool, was more popular than a number of well-known commercial data mining tools.


Jeff Zanooda said...

Another interesting question would be "what's your N"? The size of the data affects the choice of algorithms and tools, and one would expect very different results from people working with relatively small datasets (hundreds or thousands of observations) vs. large datasets (tens of thousands or more observations).

Sandro Saitta said...

I definitely agree with Jeff and I think the same remark is also valid for where you work (i.e. research or industry). For example, in research, people are usually using tools such as Matlab, Excel, R, Yale, Weka, etc. simply because they are either free or the university will pay for them. Instead of separating commercial and free data mining softwares, it would be interesting to see the results of two polls, separating research and industry users.

Will Dwinnell said...

You both make good points. It'd also be interesting to divide users by thigns like 'number of models deployed' or 'total economic value of models deployed'.

For the record, I work in industry and I use MATLAB.

Sandro Saitta said...

I really love Matlab, but it is surprising that it is used in industry. I'm curious on how you deal with the license issue. I mean, not every company has a Matlab license or would like to buy one, right? How do you deal when working with companies that are not using Matlab but are interested in having a program or function they can use/modify?

Will Dwinnell said...

I got my company to buy a (Windows) license. MATLAB on the desktop is cheap, especially compared to some of the data mining applications on the market. Annual maintenance is very cheap.

Dean Abbott said...

Are there royalties for using Matlab source code in production?

Will Dwinnell said...

Typically, I deploy models as generated source code, which means no licensing issues.

Anonymous said...

Do you use the matlab code generation products or translate the model by hand?

Dean Abbott said...

Matlab can export its own code to C (I don't know about other languages)--I've done this before and the code was actually pretty good (and bug free).

It would be quite a painful experience to do it by hand!

Anonymous said...

Hi Dean,

I saw that Mathworks had some add-ons for code generation. Did base Matlab do this type of export when you used it? Was it the desktop version?

Will Dwinnell said...

The MathWorks (the MATLAB vendor) offers tools for converting MATLAB code for use in other platforms, but this is not what I'm talking about.

Anyone using MATLAB for data mining will be writing some of their own code anyway, and discovered models will be stored in variables which are accessible to the MATLAB programmer. Given this situation, it is trivial to write a MATLAB routine to generate some text which is code. The idea is that the data miner's MATLAB program fits the model, and then generates (MATLAB or non-MATLAB) code implementing the model (as text output).

Imagine any reasonably capable programming environment in which one might fit models (like MATLAB, C, Pascal, Java, etc.). Once the coefficients are discovered for, say, a logistic regression, how hard is it to write a routine to loop over the predictors, write them out with a multiplication sign and a formatted coefficient and then spit out a standard chunk of text which performs the logistic transform at the end? Other models will be more complicated, but are easy to spit out mechanically. Dynamically adding meaningful comments (variable summaries, the date of code generation, etc.) are easy to add.

Ralph Winters said...

I'm bemused at the the simplistic poll question "What tool(s) did you use in 2007?"

What does "use" mean ?

"use" could mean anything from having an evaluation version of a product, to using it on a daily basis.

Not very informative, to me at least.

Ralph Winters

Gregory Piatetsky-Shapiro said...
This comment has been removed by the author.
Gregory Piatetsky-Shapiro said...

Thank you for suggestions - good ideas for future polls. I am the first to admit that KDnuggets annual software poll is not perfect, but at least it is interesting.
Multi-question surveys tend to produce much smaller response, so I am trying to formulate an interesting poll using a 1-question format.

Gregory Piatetsky-Shapiro

Anonymous said...

We recently completed a 27 item survey of data miners (N=314). We will complete analysis soon. The question of whether people with different data set sizes choose different tools is an excellent suggeestion, and one that we will address.

Anonymous said...

The KD-Nuggets "polls" are unscientific and subject to selection bias. So the results are very unreliable and practically meaningless. Only a scientific poll based on statistical sampling will reveal meaningful results and trends.

Dean Abbott said...

A response to some of the critques:

While the polls are not controlled (they are self-reporting, and subject to bias due to readership of KDNuggets and "get-out-to-vote" efforts of vendors), I wouldn't say they are meaningless. They do reflect the interests of the KDNuggets crowd, and that in of itself is interesting.

Karl's poll (hopefully results of which will be published soon) will be interesting, but this too will have some biases because it is self-initiated, not interviewed, and it is long enough that there will be some dropouts. But that's the nature of polling and surveys anyway.

Regarding the critique of the word "use", I don't have a problem with that word per se, though you (Ralph) are correct in pointing to the vague nature of the word. Getting some idea of the frequency of use may overcome some of that problem (daily, weekly, monthly). However, that reflects on the people who use the software--I take the poll as measuring (however imperfectly) the software installations that get used, and don't just sit there on the shelf.