Tuesday, November 06, 2012

Why Predictive Modelers Should be Suspicious of Statistical Tests (or why the Redskin Rule fools us)

Well, the danger is really not the statistical test per se, it the interpretation of the statistical test.

Yesterday I tweeted (@deanabb) this fun factoid: "Redskins predict Romney wins POTUS #overfit. if Redskins lose home game before election => challenger wins (17/18) http://www.usatoday.com/story/gameon/2012/11/04/nfl-redskins-rule-romney/1681023/" I frankly had never heard of this "rule" before and found it quite striking. It even has its own Wikipedia page (http://en.wikipedia.org/wiki/Redskins_Rule).

For those of us in the predictive analytics or data mining community, and those of us who use statistical tests to help out interpreting small data, 17/18 we know is a hugely significant finding. This can frequently be good: statistical tests will help us gain intuition about value of relationships in data even when they aren't obvious.

In this case, an appropriate test is a chi-square test based on the two binary variables (1) did the Redskins win on the Sunday before the general election (call it the input or predictor variable) vs. (2) did the incumbent political party win the general election for President of the United States (POTUS).

According to the Redskins Rule, the answer is "yes" in 17 of 18 cases since 1940. Could this be by chance? If we apply the chi-square test to it, it sure does look significant! (chi-square = 14.4, p < 0.001). I like the decision tree representation of this that shows how significant it is (built using the Interactive CHAID tree in IBM Modeler on Redskin Rule data I put together here):

It's great data--9 Redskin wins, 9 Redskin losses, great chi-square statistic!

OK, so it's obvious that this is just another spurious correlation in the spirit of all of those fun examples in history, such as the superbowl winning conference predicting if the stock market would go up or down in the next year at a stunning 20 or 22 correct. It even was the subject of academic papers on the subject!

The broader question (and concern) for predictive modelers is this: how do we recognize when we have uncovered spurious correlations in the data that are merely spurious? This can happen especially when we don't have deep domain knowledge and therefore wouldn't necessarily identify variables or interactions as spurious. In examples such as the election or stock market predictions, no amount of "hold out" samples, cross-validation or bootstrap sampling would uncover the problem: it is in the data itself.

We need to think about this because inductive learning techniques search through hundreds, thousands, even millions of variables and combinations of variables. The phenomenon of "over searching" is a real danger with inductive algorithms as they search and search for patterns in the input space. Jensen and Cohen have a very nice and readable paper on this topic (PDF here). For trees, they recommend using the Bonferroni adjustment which does help penalize the combinatorics associated with splits. But our problem here goes far deeper than overfitting due to combinatorics.

Of course the root problem with all of these spurious correlations is small data. Even if we have lots of data, what I'll call here the "illusion of big data", some algorithms make decisions based on smaller populations, like decision trees, rule induction and nearest neighbor (i.e., algorithms that build bottom-up). Anytime decisions are made from populations of 15, 20, 30 or even 50 examples, there is a danger that our search through hundreds of variables will turn out a spurious relationship.

What do we do about this? First, make sure you have enough data so that these small-data effects don't bite you. This is why I strongly recommend doing data audits and looking for categorical variables that contain levels with at most dozens of examples--these are potential overfilling categories.

Second, don't hold strongly any patterns discovered in your data based on solely on the data, especially if they are based on relatively small sample sizes. These must be validated with domain experts. Decision trees are notorious for allowing splits deep in the trees that are "statistically significant" but dangerous nevertheless because of small data sizes.

Third, the gist of your models have to make sense. If they don't, put on your "Freakonomics" hat and dig in to understand why the patterns were detected by the models. In our Redskin Rule, clearly this doesn't make sense causally, but sometimes the pattern picked up by the algorithm is just a surrogate for a real relationship. Nevertheless, I'm still curious to see if the Redskin Rule will prove to be correct once again. This year it predicts a Romney win because the Redskins lost and therefore the incumbent party (D) by the rule should lose. UPDATE: by way of comparison...the chances of having 17/18 or 18/18 coin flips turn up heads (or tails--we're assuming a fair coin after all!) is 7 in 100,000 or 1 in 14,000. Put another way, if we examined 14K candidate variables unrelated to POTUS trends, the chances are that one of them would line up 17/18 or 18/18 of the time. Unusual? Yes. Impossible? No!


Sandro Saitta said...

Very interesting article! About spurious correlation, what do you think of this study: http://www.psychologytoday.com/blog/the-scientific-fundamentalist/201010/why-intelligent-people-drink-more-alcohol

Dean Abbott said...

All I know is that as an undergraduate, the heaviest drinkers were not the best students!

Do you think this is a spurious correlation (the alcohol article)? It is surprising to me. I do wonder how they controlled for so many factors in a single study and what the quality of the data is (was it self-report on alcohol consumption?). I'll think about this one though...

A second item that comes to mind is a New Yorker Magazine article entitled "THE TRUTH WEARS OFFIs there something wrong with the scientific method?" that describes how many studies accepted as true (with statistically significant findings) could not be replicated later. (see http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer)

Anonymous said...

When you say two binary variables do you mean two dichotomous variables? In that case wouldn't you use Cramer's V or Phi based on whether or not it was a 2 x 2 (contingency or cross tabulation) table?

Dean Abbott said...
This comment has been removed by the author.
Dean Abbott said...

Anonymous: Isn't Cramer's V for a 2x2 essentially the same as a chi-square test? I think they are (differing by a square root).

And yes, I use the terminology binary--I'm not a statistician by training, and it seems to me that statisticians are more inclined to use "dichotomous" than applied mathematicians (where I had my start).

Peter Bruce said...

A simple, but fundamental, related analogy is this: Suppose a person told you that she could toss a coin 15 times and have it come up heads each time, and then she did it on her first try? You would think that pretty remarkable. Now suppose you asked all 35,000 people in Fenway Park to toss a coin, and one of them raises his hand and says he has 15 heads. Not remarkable at all (the odds are in favor of it happening, in fact).

Dean Abbott said...

exactly--nice example.

Your example is a case where human intuition is good; most everyone I think would agree that with 35K people, the result is unremarkable.

Other times human intuition is not so good, like the amazement people have when a room of 23 people contain 2 with the exact same birthday (month/day). exact same birthday. Not amazing at all, but it seems to be.

Dean Abbott said...

An interesting NPR story on intuition and statistics is here (note: the interview is about 9 minutes) Very interesting examples on predicting if someone in jail is likely to repeat once on parole.

Peter Bruce said...

Another NPR story: Several years ago, after a woman won the Texas lottery for the second time, an NPR reporter called to interview me about how unusual this was. I explained that it would be very unusual, indeed, for a given individual to buy two tickets and have them both be big winners - about like having a meteorite hit your bus. However, I went on to explain that the probability of this happening to some person at some time in some state was very much higher. In fact, I pointed out, it had indeed happened before. When I heard the edited interview on the air, however, all that came through was the bit about the bus being hit by the meteor. The larger point was either not understood by the reporter, or not deemed to be the story of interest.

Spencer said...

Good post. I had never heard of the Redskin Rule before.

I think the key is to keep in mind your 2nd and 3rd points: Consult with domain experts. If you or they can't find a reasonable explanation for the relationship, it's probably bunk!