Tuesday, October 23, 2007

Follow-Up to: Statistics: Why Do So Many Hate It?

How would you run this regression?
A relationship between beer expenditure and income was tested. The relationship may be qualitatively effected by gender. How would you test the hypothesis that women spend less money on beer than women?

My guess is that this is a homework question, and that the teacher wants students to use a dummy variable to represent gender, so that a simple interpretation of gender's coefficient will reveal the answer.

In reality, of course, the interaction of income and gender may yield a more nuanced answer. What if two regressions were performed, one for men and the other for women, with income as the predictor and beer expenditure as the target, and the regression lines crossed? Such a result precludes so simple a response as "men spend more on beer".

This question suggests another reason so many people hate statistics: its subtlety. The annoying thing about reality (which is the subject of statistical study), is that it is so complicated. Even things which seem simple will often reveal surprisingly complex behavior. The problem is that people don't want complicated answers. Although my response is: It is foolish to expect simple solutions to complicated problems, the fundamental, irreducible complexity of reality- which is mirrored in statistics- also drives negative feelings toward statistics.

Dean Abbott said...

Your example also demonstrates one of the difficulties with regression: they are global linear functions. If the trend doesn't exist globally, the model will miss this characteristic unless one is fortunate enough to find just the right transformation to linearize the relationship.

In the example of beer-spend by gender, a decision tree may (and I repeat "may") provide more insight because it can find local, homogeneous behavior.

Will Dwinnell said...

You're right, of course, and this only reinforces my point. I suggest that if the analysis better approximated reality by using a fancy transformation of the data, nonlinear regression or decision trees, the analysis would become too complex for some fraction of the population. They would feel frustrated by the complexity of the statistics, but only, I contend, because it reflects a reality which is more complex than they are willing or prepared to deal with.

Ralph Winters said...

I also will pipe in and suggest that the hypothesis as it is stated, actually has nothing to do with income,
it is only related to gender, and expenses. It is easy to assume the income/spending correlation. Just another illustration of just how precise we need to be when asking questions. As Will stated, the subtleties are there, and data miners, we all need to take the time to clarify these kinds of things before we even start the analysis

Ralph Winters

Anonymous said...

Excellent comments, the answer should be a simple as possible and no simpler thus it doesn't have to be simple at all but the complexity should represent the reality of the phenomenon under study.

Many people are simply looking for the simple answer but nothing that isn't simple will be...

Jay