Monday, April 25, 2011

Statistical Rules of Thumb, part II

A while back, Will Dwinnell posted on two books, one of which is one of my favorites as well:

Will mentioned a few general topics covered in the book, but I thought I would mention two specific ones that I agree with wholeheartedly.

7.3: Always Graph the Data
In this section he quotes E.R. Tufte as follows (Abbott quoting van Belle quoting Tufte):
Graphical Excellence is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the shortest space.

I'm not so sure I agree with the superlatives, I certainly agree with the gist that excellence in graphics is parsimonious, clear, insightful, and informationally rich. Contrast this to another rule of thumb:

7.4: Never use a Pie Chart
well, that's not exactly rocket science; pie charts have lots of detractors...The only thing worse than a pie chart is a 3-D pie chart!

7.6: Stacked Barcharts are Worse than Bargraphs.
Perhaps the biggest problem with stacked bar graphs (such as the one here) is that you cannot see clearly the comparison between the colored values in the bins.

(a good summary of why they are problematic is in Stephen Few's Newletter, which you can download here)

I have found that data shown in a chart like this can be shown better in a table, perhaps with some conditional formatting (in Excel) or other color coding to push the eye toward the key differences in values. For continuous data, this often means binning a variable (akin to the histogram) and creating a cross-tab. The key is clarity--make the table so that the key information is obvious.

Tuesday, April 19, 2011

Rexer Analytics data mining survey

Rexer Analytics, a data mining consulting firm, is conducting their 5th annual survey of the analytic behaviors, views and preferences of data mining professionals. I urge all of you to respond to the survey and help us all understand better the nature of the data mining and predictive analytics industry. The following text contains their instructions and overview.

If you want to skip the verbage and just get on with the survey, use code RL3X1 and go here.

Your responses are completely confidential: no information you provide on the survey will be shared with anyone outside of Rexer Analytics. All reporting of the survey findings will be done in the aggregate, and no findings will be written in such a way as to identify any of the participants. This research is not being conducted for any third party, but is solely for the purpose of Rexer Analytics to disseminate the findings throughout the data mining community via publication, conference presentations, and personal contact.

To participate, please click on the link below and enter the access code in the space provided. The survey should take approximately 20 minutes to complete. Anyone who has had this email forwarded to them should use the access code in the forwarded email.

Survey Link:
Access Code: RL3X1

If you would like a summary of last year’s or this year’s findings emailed to you, there will be a place at the end of the survey to leave your email address. You can also email us directly ( if you have any questions about this research or to request research summaries. Here are links to the highlights of the previous years’ surveys. Contact us if you want summary reports from any of these years.
-- 2010 survey highlights:
-- 2009 survey highlights:
-- 2008 survey highlights:
-- 2007 survey highlights:

Thank you for your time. We hope this research program continues to provide useful information to the data mining community.


Karl Rexer, PhD

Monday, April 11, 2011

Predictive Models are not Statistical Models — JT on EDM

This post was first posted on Predictive Models are not Statistical Models — JT on EDM

My friend and colleague James Taylor asked me last week to comment on a question regarding statistics vs. predictive analytics. The bulk of my reply is on James' blog; my fully reply is here, re-worked from my initial response to clarify some points further.

I have always love reading the green "Sage" books, such as Understanding Regression Assumptions (Quantitative Applications in the Social Sciences)
or Missing Data (Quantitative Applications in the Social Sciences) because they are brief, cover a single topic, and are well-written. As a data miner though, I am also somewhat amused reading them because they are obviously written by statisticians with the mindset that the model is king. This means that we either pre-specify a model (the hypothesis) or require the model be fully interpretable, fully representing the process we are modeling. When the model is king, it's as if there is a model in the ether that we as modelers must find, and if we get coefficients in the model "wrong", or if the model errors are "wrong", we have to rebuild the data and then the model to get it all right.

In data mining and predictive analytics, the data is king. These models often impute the models from the data (decision trees do this), or even if they only fit coefficients (like neural networks), it's the accuracy that matters rather than the coefficients. Often, in the data mining world, we won't have to explain precisely why individuals behave as they do so long as we can explain generally how they will behave. Model interpretation is often related to describing trends (sensitivity or importance of variables).

I have always found David Hand's summaries of the two disciplines very useful, such as this one here; I found that he had a healthy respect for both disciplines.