## Tuesday, January 09, 2007

### Free Data Mining Software Poll Results, and notes on Sample Size

I inadvertantly closed the poll, couldn't figure out how to reopen it, and since it was already up a week, I decided that I will leave it closed.

The results are:
WEKA: 11 (55%)
YALE: 4 (20%)
R: 3 (15%)
Custom: 1 (5%)
Other: 1 (5%)
Total Votes: 20

But is there anything signficant? Is WEKA signficantly more popular than YALE or R? Well, this is outside of my expertise--after all, the word "signficant" is rarely used in data mining circles :)--but it seems to me that the answer is "yes". Why?

By starting with the standard sample size formula, and using the WEKA percentage as the hypothesis (55%, or 0.55), we are only 68% confident that this 55% can be achieved with a sample size of 25 (larger than I used). So it is therefore not a particularly significant finding that WEKA is not more popular than the other tools.

Plugging in the numbers for just WEKA and YALE (if that were the extent of the survey, forcing everyone to vote between just those two, which of course did not happen, but play along for a bit...), where the difference was 55% to 20%, we find that for a sample sizes of 15 (11 votes + 4 vote), we would have been more than 99% confident that the 55% +/- 35% can be achieved.

I'll try another poll once the numbers coming to this blog go up a bit. Thanks for participating!

#### 2 comments:

John Aitchison said...

The significance of differences between percentages from the same sample can get somewhat complicated depending on exactly what hypothesis it is you are trying to test, and on how many alternatives .. if more than two, you have a multinomial distribution.

Have a look at
http://www.pollster.com/mystery_pollster/when_is_a_lead_really_a_lead.php
for a fairly acessible discussion of when a lead is really a lead.

And you can also look at my blog posting on how you would use simulation in this situation.

John Aitchison said...

oops, forgot to put in the URL

http://dsanalytics.com/dsblog/why-simulation-is-better-than-statistics_80

also, blogger gurus, why is it that I cannot put my website URL when I post as "other" without it getting mangled by blogger? Do I have to start up a blog on blogger to get it done right? thanks