Applied Data Science and Machine Learning: Yet another "Wisdom of Crowds" success

Friday, July 29, 2011

Yet another "Wisdom of Crowds" success

I was at the Federal Building downtown San Diego for a consulting job, and met some representatives for a life and disability insurance company who were giving away a big-screen HD TV for the individual who came closest to guessing the number of M&Ms (chocolate and peanut butter filled) in a container. Because they do this often, I won't show the specific container they use.

I offered to make a guess of the total, but only if I could see all of the guesses so far. I was drawing from the Wisdom of Crowds example from Chapter 1 of the book where a set of independent guesses tend to outperform even an expert's best guess. I've done the same experiment many times in data mining courses I've taught, and have found the same phenomenon.

I collected data from 77 individuals (including myself) shown here (sorted for convenience, but this makes no difference in the analysis):
37
625
772
784
875
888
903
929
983
987
1001
1015
1040
1080
1080
1124
1245
1250
1450
1500
1536
1596
1600
1774
1875
1929
1972
1976
1995
2000
2012
2033
2143
2150
2200
2221
2235
2251
2321
2331
2412
2500
2500
2550
2571
2599
2672
2714
2735
2777
2777
2803
2832
2873
2931
3001
3101
3250
3333
3362
3500
3500
3501
3501
3583
3661
3670
3697
3832
3872
4280
4700
4797
5205
5225
5257
9886
10000
187952

Note there are a few flakey ones in the lot. The last two were easy to spot (so I put them at the bottom of my list). The idea of course is to just take the average of the guesses.

Average all: 4932
Average all without 37 and 187932: 2626

Then I looked at the histogram and decided that the guesses close to 10000 were also too flaky to include:

So I removed all data points greater than 8000, which took away 2 samples, leaving this histogram and a mean of 2436.

So now for the outcome:
Actual Count: 2464
Average of trimmed sample: 2436 (error 28)
Best individual guess: 2500 (error 36)

So amazingly, the average won, though I wouldn't have been disappointed at all if it finished 3rd or 4th because it still would have been a great guess.

Wisdom of Crowds wins again!

PS I reported to the insurance agents a guess of 2423 because I had omitted my original guess (provided before looking at any other guesses--2550 if you must know) and my co-worker's guess of 3250, so these helped bring up the mean a bit. The Average would have lost (barely) if I had not included them.

PPS So how will they split the winnings since two guessed the same value? I won't recommend the saw approach. I hope they ask each of the two guessers to either modify their guess, and require they modify their guess by at least one.

PPPS Note: the charts were done using JMP Pro 9 for the Macintosh

7 comments:

Thomas said...: shouldn't you rather compare the median and not the average in this case?

btw, your 'wisdom of crowds' link puts the book directly in the shopping cart. great way to finance this blog huh? ;); 6:53 PM
Dean Abbott said...: The median is fine to use, though in this case it would do worse.

Sorry about the shopping cart--I didn't realize it was doing that (I'll fix it). And, I don't get any money from amazon any more for links on this blog thanks to the California legislature, so I *wish* it were the case that amazon could finance the blog. :); 7:04 PM
Will Dwinnell said...: 1. Interesting topic, and thanks for providing the actual data.

2. Other summaries might be preferred to the mean and median, such as a (mechanically) trimmed mean: The mean being sensitive to data which misbehaves, and the median suffering from weak statistical efficiency (in many common circumstances).; 7:39 AM
Dean Abbott said...: Will--fully agree. The final answer for me was exactly a mechanically trimmed mean where the top 3 and bottom 1 entries were removed because they were such extreme outliers.

If the data isn't skewed after removing the outliers, the mean and median should be similar. If the data is skewed at this point, there is something else wrong because people don't typically guess 'skewed'.

For this data, the skew is only 0.5 (kurtosis is -0.15), so while there are some differences in the final guess, mean guess is 2436 and median guess is 2331. The median would have tied for 7th--still pretty good.; 7:58 AM
Data Mining for Intelligence said...: Interesting analysis. Btw, I got a copy of Wisdom of Crowds in kindle...; 11:31 PM
ranjini said...: My cousin recommended this blog and she was totally right keep up the fantastic work!

Embedded Systems Course; 3:36 AM
sourcing from china said...: I think Your offer is well but make some changes for this, Over all your post in nice!

sourcing from china; 10:34 AM

Applied Data Science and
Machine Learning

Friday, July 29, 2011

Yet another "Wisdom of Crowds" success

7 comments:

Applied Predictive Analytics

Contributors

Our Web Sites

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and Machine Learning

Friday, July 29, 2011

Yet another "Wisdom of Crowds" success

7 comments:

Applied Predictive Analytics

Contributors

Our Web Sites

Subscribe To This Blog

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and
Machine Learning