Friday, November 04, 2011

Statistical Rules Of Thumb, part III: Always Visualize the Data

As I perused Statistical Rules of Thumb again, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).

Van Belle uses the term "Graph" rather than "Visualize", but it is the same idea. The point is to visualize in addition to computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I've seen these as well, especially variables with outliers or that are bi- or tri-modal.

One of the most famous examples of this effect is Anscombe's Quartet. I'm including the Wikipedia image of the plots here:

All four datasets have the same mean x values, y values, x standard deviation, y standard deviation, x-y pearson correlation coefficient, and regression line of y, so the summaries don't tell the differences in the data.

I use correlations a lot to get the gist of the relationships in the data, and I've seen how correlations can deceive. In one project, we had 30K data points with a correlation of 0.9+. When we removed just 100 of these data points (the largest magnitudes of x and y), the correlation shrunk to 0.23.

Most data mining software has ways to visualize data easily now. Avail yourself to them to avoid subsequent surprises in your data.


Mercedes Sprinter Van Turbo said...

You have really done a great work to share the hidden art of the great man. It is really a nice work by them. Thanks a lot for this


Your article is very good and very useful for us, thank you for giving information very useful and very valuable to us, may you continue to provide information and provide insight that is always helpful to us


Unknown said...

I am very glad that I came across here. You did a great job.

Commercial Vehicle Insurance Dallas