Applied Data Science and Machine Learning: statistics

Showing posts with label statistics. Show all posts

Sunday, March 06, 2011

Statistics: The Need for Integration

I'd like to revisit an issue we covered here, way back in 2007: Statistics: Why Do So Many Hate It?. Recent comments made to me, both in private conversation ("Statistics? I hated that class in college!"), and in print prompt me to reconsider this issue.

One thing which occurs to me is that many people have a tendency to think of statistics in an isolated way. This world view keeps statistics at bay, as something which is done separately from other business activities, and, importantly, which is done and understood only by the statisticians. This is very far from the ideal which I suggest, in which statistics (including data mining) are much more integrated with the business processes of which they are a part.

In my opinion, this is a strange way to frame statistics. As an analog, imagine if, when asked to produce a report, a business team turned to their "English guy", with the expectation that he did all the writing. I am not suggesting that everyone needs to do the heavy lifting that data miners do, but that people who don't accept some responsibility for data mining's contribution to the business process. Managers, for example, who throw up their hands with the excuse that "they are not numbers people" forfeit control over an important part of their business function. It is healthier for everyone involved, I submit, if statistics moves away from being a black art, and statisticians become less of an arcane priesthood.

Thursday, October 28, 2010

A humorous explanation of p-values

After Will's great post on sample sizes that referenced the youtube video entitled Statistics vs. Marketing, I found an equally funny and informative explanation on p-values here.

Aside from the esoteric explanations of what a p-value is, there is a point that I make often with customers that statistical significance (from p-values) is not the same thing as operational significance; just because you find a p-value of less than 0.05 doesn't mean the result is useful for anything! Enjoy.

Saturday, April 25, 2009

Taking Assumptions With A Grain Of Salt

Occasionally, I come across descriptions of clustering or modeling techniques which include mention of "assumptions" being made by the algorithm. The "assumption" of normal errors from the linear model in least-squares regression is a good example. The "assumption" of Gaussian-distributed classes in discriminant analysis is another. I imagine that such assertions must leave novices with some questions and hesitation. What happens if these assumptions are not met? Can techniques ever be used if their assumptions are not tested and met? How badly can the assumption be broken before things go horribly wrong? It is important to understand the implications of these assumptions, and how they affect analysis.

In fact, the assumptions being made are made by the theorist who designed the algorithm, not the algorithm itself. Most often, such assumptions are necessary for some proof of optimality to hold. Considering myself the practical sort, I do not worry too much about these assumptions. What matters to me and my clients is how well the model works in practice (which can be assessed via test data), not how well its assumptions are met. Generally, such assumptions are rarely, if ever, strictly met in practice, and most of these algorithms do reasonably well even under such circumstances. A particular modeling algorithm may well be the best one available, despite not having its assumptions met.

My advice is to be aware of these assumptions to better understand the behavior of the algorithms one is using. Evaluate the performance of a specific modeling technique, not by looking back to its assumptions, but by looking forward to expected behavior, as indicated by rigorous out-of-sample and out-of-time testing.

Tuesday, October 23, 2007

Follow-Up to: Statistics: Why Do So Many Hate It?

In a question posted Oct-14-2007 to Yahoo! Answers, user lifetimestudentofmath asked:

How would you run this regression?
A relationship between beer expenditure and income was tested. The relationship may be qualitatively effected by gender. How would you test the hypothesis that women spend less money on beer than women?

My guess is that this is a homework question, and that the teacher wants students to use a dummy variable to represent gender, so that a simple interpretation of gender's coefficient will reveal the answer.

In reality, of course, the interaction of income and gender may yield a more nuanced answer. What if two regressions were performed, one for men and the other for women, with income as the predictor and beer expenditure as the target, and the regression lines crossed? Such a result precludes so simple a response as "men spend more on beer".

This question suggests another reason so many people hate statistics: its subtlety. The annoying thing about reality (which is the subject of statistical study), is that it is so complicated. Even things which seem simple will often reveal surprisingly complex behavior. The problem is that people don't want complicated answers. Although my response is: It is foolish to expect simple solutions to complicated problems, the fundamental, irreducible complexity of reality- which is mirrored in statistics- also drives negative feelings toward statistics.

Wednesday, October 17, 2007

Statistics: Why Do So Many Hate It?

In Why is Statistics So Scary?, the Sep-26-2007 posting to the Math Stats And Data Mining Web log, the author wonders why so many people exhibit negative reactions to statistics.

I've had occasion to wondered about the same thing. I make my living largely from statistics, and have frequently received unfavorable reactions when I explain my work to others. Invariably, such respondents admit the great usefulness of statistics, so that is not the source of this negativity. I am certain that individual natural aptitude for this sort of work varies, but I do not believe that this accounts for the majority of negative feelings towards statistics.

Having received formal education in what I call "traditional" or "classical" statistics, and having since assisted others studying statistics in the same context, I suggest that one major impediment for many people is the total reliance by classical statisticians on a large set of very narrowly focused techniques. While they serve admirably in many situations, it is worth noting the disadvantages of classical statistical techniques:

1. Being so highly specialized, there are many of these techniques to remember.

2. It is also necessary to remember the appropriate applications of these techniques.

3. Broadly, classical statistics involves many assumptions. Violation of said assumptions may invalidate the results of these techniques.

Classical techniques were developed largely during a time without the benefit of rapid, inexpensive computation, which is very different from the environment we enjoy today.

The above were major motivations for me to embrace newer analytical methods (data mining, bootstrapping, etc.) in my professional life. Admittedly, newer methods have disadvantages of their own (not the least of which is their hunger for data), but it's been my experience that newer methods tend to be easier to understand, more broadly applicable and, consequently, simpler to apply.

I think the broader educational question is: Would students be better served by one or more years of torture, imperfectly or incorrectly learning myriad methods which will soon be forgotten, or the provision of a few widely useful tools and an elemental-level of understanding?

Applied Data Science and
Machine Learning

Sunday, March 06, 2011

Statistics: The Need for Integration

Thursday, October 28, 2010

A humorous explanation of p-values

Saturday, April 25, 2009

Taking Assumptions With A Grain Of Salt

Tuesday, October 23, 2007

Follow-Up to: Statistics: Why Do So Many Hate It?

Wednesday, October 17, 2007

Statistics: Why Do So Many Hate It?

Applied Predictive Analytics

Contributors

Our Web Sites

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and Machine Learning

Sunday, March 06, 2011

Statistics: The Need for Integration

Thursday, October 28, 2010

A humorous explanation of p-values

Saturday, April 25, 2009

Taking Assumptions With A Grain Of Salt

Tuesday, October 23, 2007

Follow-Up to: Statistics: Why Do So Many Hate It?

Wednesday, October 17, 2007

Statistics: Why Do So Many Hate It?

Applied Predictive Analytics

Contributors

Our Web Sites

Subscribe To This Blog

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and
Machine Learning