Tuesday, October 23, 2007

Follow-Up to: Statistics: Why Do So Many Hate It?

How would you run this regression?
A relationship between beer expenditure and income was tested. The relationship may be qualitatively effected by gender. How would you test the hypothesis that women spend less money on beer than women?

My guess is that this is a homework question, and that the teacher wants students to use a dummy variable to represent gender, so that a simple interpretation of gender's coefficient will reveal the answer.

In reality, of course, the interaction of income and gender may yield a more nuanced answer. What if two regressions were performed, one for men and the other for women, with income as the predictor and beer expenditure as the target, and the regression lines crossed? Such a result precludes so simple a response as "men spend more on beer".

This question suggests another reason so many people hate statistics: its subtlety. The annoying thing about reality (which is the subject of statistical study), is that it is so complicated. Even things which seem simple will often reveal surprisingly complex behavior. The problem is that people don't want complicated answers. Although my response is: It is foolish to expect simple solutions to complicated problems, the fundamental, irreducible complexity of reality- which is mirrored in statistics- also drives negative feelings toward statistics.

Wednesday, October 17, 2007

Statistics: Why Do So Many Hate It?

In Why is Statistics So Scary?, the Sep-26-2007 posting to the Math Stats And Data Mining Web log, the author wonders why so many people exhibit negative reactions to statistics.

I've had occasion to wondered about the same thing. I make my living largely from statistics, and have frequently received unfavorable reactions when I explain my work to others. Invariably, such respondents admit the great usefulness of statistics, so that is not the source of this negativity. I am certain that individual natural aptitude for this sort of work varies, but I do not believe that this accounts for the majority of negative feelings towards statistics.

Having received formal education in what I call "traditional" or "classical" statistics, and having since assisted others studying statistics in the same context, I suggest that one major impediment for many people is the total reliance by classical statisticians on a large set of very narrowly focused techniques. While they serve admirably in many situations, it is worth noting the disadvantages of classical statistical techniques:

1. Being so highly specialized, there are many of these techniques to remember.

2. It is also necessary to remember the appropriate applications of these techniques.

3. Broadly, classical statistics involves many assumptions. Violation of said assumptions may invalidate the results of these techniques.

Classical techniques were developed largely during a time without the benefit of rapid, inexpensive computation, which is very different from the environment we enjoy today.

The above were major motivations for me to embrace newer analytical methods (data mining, bootstrapping, etc.) in my professional life. Admittedly, newer methods have disadvantages of their own (not the least of which is their hunger for data), but it's been my experience that newer methods tend to be easier to understand, more broadly applicable and, consequently, simpler to apply.

I think the broader educational question is: Would students be better served by one or more years of torture, imperfectly or incorrectly learning myriad methods which will soon be forgotten, or the provision of a few widely useful tools and an elemental-level of understanding?

Tuesday, October 16, 2007

See The World

I recently had the pleasure of attending the Insightful Impact 2007 conference, where I especially enjoyed a presentation on ensemble methods by two young, up-and-coming, aspiring data miners: Brian Siegel and his side-kick... Deke Abbott, or Dean Abner, or some such.

I am frequently asked what is the best way to learn about data mining (or machine learning, statistics, etc.). I get a great deal of information from reading, either books or white papers and reports which are available for free, on-line. Another great learning experience involves attendance of conferences and trade shows. I don't travel a great deal and find it convenient to attend whatever free or cheap events happen to be within close distance. I also try to get to KDD when it's on the east coast of the United States. Aside from the presentations, events like these are an opportunity to get away from the muggles and spend some time with other data miners. I highly recommend it.

Nice job, Dean and Brian.