Wednesday, August 15, 2007

The latest Y2K bug--and why mean values don't tell the whole story

I was interested in the recent hubbub over surface temperatures as first written in NASA's Daily Tech, and picked up by other news sources. (Note: the article doesn't render well for me in Firefox, but IE is fine).

However, I found this article describing the data even more interesting, from the Climate Audit Blog. From a data mining / statistics perspective, it was the distribution of the errors that was interesting. I had read in the media (sorry-don't remember where) that there was an average error of 0.15 deg. C due to the Y2K error in the data--that didn't seem too bad. But, at the blog, he describes that the errors are (1) bimodal, (2) postively skewed (hence the positive average error), and (3) typically much larger than 0.15 deg. So while on average it doesn't seem bad, the surface temperature errors are indeed significant.

Once again, averages can mask data issues. Better to augment averages with other metrics, or better yet, visualize!

Anonymous said...

Yes, I've been following this too a little, and it is fascinating on so many levels--sampling plan, data quality, analysis, and interpretation. And the debate is tinged with the normal religious fervor you would expect.

A lot of data I'm working with now is (heavily) positively skewed so means don't give a useful picture of the data. My impression is that the analysis of the GISS data also used moving averages.