In many fields, it is common to find a gap between theorists and practitioners. As stereotypes, theorists have a reputation for sniffing at anything which has not been optimized and proven to the nth degree, while practitioners show little interest in theory, as it "only ever works on paper".

I have been amazed at both extremes of this spectrum. Academic and standards journals seem to publish mostly articles which solve theoretical problems which will never arise in practice (but which permit solutions which are elegant or which can be optimized to some ridiculous level), or solutions which are trivial variations on previous work. The same goes for most masters and doctoral theses. On the other hand, I was shocked when software development colleagues (consultants: the last word in practice over theory) were unfamiliar with two's complement arithmetic.

Data mining is certainly not immune to this problem. Not long ago, I came upon technical documentation for a linear regression which had been "fixed" by a logarithmic transformation of the dependent variable. (There is a correct way to fit coefficients in this circumstance, but that was not done in this case.) Even more astounding was the polynomial curve fit which was applied to "undo" the log transformation, to get back to the original units! Sadly, the practitioners in question did not even recognize the classic symptom of their error: residuals were much larger at the high end of their plots.

Data miners (statisticians, quantitative analysts, forecasters, etc.) come from a variety of fields, and enjoy diverse levels of formal training. Grounding in theory follows suit. The people we work for typically are capable of identifying only the most egregious technical errors in our work. This sets the stage for potential problems.

As a practitioner, I have found much that is useful in theory and suggest that it is a fountain which is worth returning to, from time to time. Reviewing new developments in our field, searching for useful techniques and guidance will benefit data miners, regardless of their seniority.

## Friday, September 24, 2010

## Tuesday, September 07, 2010

### DM Radio - Predictive Analytics and Fraud Detection

I'll be on DM Radio Thursday September 9 at 3pm EDT. Here's the blurb:

How many ways to catch a thief? More and more, thanks to predictive analytics, data-as-a-service and other clever computing tricks. Stopping fraud in its tracks can save customers, money and more. Tune into this episode of DM Radio to find out how. We'll hear from Eric Siegel, Prediction Impact; Erick Brethenoux, SPSS; Jason Trunk, Quest Software and Dean Abbott, Abbott Analytics.

## Thursday, September 02, 2010

### Leo Breiman quote about statisticians

One nice thing about having to move offices is that it forces you to go through old papers and folders. I found my folder containing KDD 97 conference notes, including quotes in the tutorial by David Hand from Leo Breiman (1995):

In courses I teach, one of my objectives is to take the mathematics of the algorithms and translate the practical meaning of what they do into understandable pieces so that practitioners can manipulate learning rates and hidden units, gini and two-ing, radial kernels and polynomials kernels. Understanding backprop isn't important to most practitioners, but understanding how one can improve the performance of backprop is very much a key topic for practitioners.

We need more Breimans to pave the way toward practical innovations in predictive modeling.

One problem in the field of statistics has been that everyone wants to be a theorist. Part of this is envy - the real sciences are based on mathematical theory. In the universities for this century, the glamor and prestige has been in mathematical models and theorems, no matter how irrelevant.I love this quote because it highlights the divide between the practical and the elegant or sophisticated. Data mining and predictive analytics are "low-brow" sciences, empirical, and practical. That doesn't mean that the mathematics aren't important; they are very much so. But while we wait for the elegances of a theory to trickle down to us, we still need solutions.

In courses I teach, one of my objectives is to take the mathematics of the algorithms and translate the practical meaning of what they do into understandable pieces so that practitioners can manipulate learning rates and hidden units, gini and two-ing, radial kernels and polynomials kernels. Understanding backprop isn't important to most practitioners, but understanding how one can improve the performance of backprop is very much a key topic for practitioners.

We need more Breimans to pave the way toward practical innovations in predictive modeling.

Subscribe to:
Posts (Atom)