Saturday, January 13, 2007

Do and Do Not

There's too many men, too many people making too many problems, and not much love to go 'round. Can't you see? This is the land of confusion.

-Genesis, Land of Confusion



In my travels, I have encountered a wide variety of people who use mathematics to analyze data and make predictions. They go by a variety of titles and work in many different fields. My first job out of college was working in an econometrics group for the Port Authority of New York and New Jersey, in the Twin Towers. The emphasis there was on traditional econometric techniques. Later in my career, I worked as a consultant for SKF, a large manufacturing firm, with engineers who emphasized quality control techniques. Most recently, I have been working with bankers doing credit scoring and the like. Surprise, surprise: the bankers have their own way of doing things, too. I won't bore the reader with the myriad other diverse quantitative analysts I've met in between, because you probably already get the idea.

These industry-specific sub-disciplines of analysis developed largely in isolation and, unfortunately, most are quite parochial. For the most part, technique has become stagnant, reflecting old rules of thumb which are outdated, if they weren't invalid in the first place.

Many people say that data mining (modeling, forecasting, etc.) are "part art, part science". I agree, but the science should give parameters to the art. Creativity in the combined discipline of quantitative model-building does not give license to venture beyond the absolutes that statistical science has provided. From this perspective, there are some things which should always be practiced, and some which should never be practiced: Do and Do Not: Everything in between is up to the taste of the analyst.

Sadly, many practitioners and even entire industries have become arthritic by establishing new, would-be "absolutes" beyond the dictates of probability theory. Some of these rules attempt to expand the Do by setting capricious limits on modeling which are not theoretically justified. The Director of risk management at one credit card company once told me that a "good" model had about 8 or 10 inputs. Naturally, that is nonsense. The number of input variables should be determined by the data via appropriate testing, not some rule-of-thumb. Others of these rules try to expand the Do Not by prohibiting practices which are well established by both theory and experiment.

As a data miner ("statistician", "econometrician", "forecaster", "meteorologist", "quality technician", "direct marketer", etc.), it is one's responsibility to continue to study the latest literature to understand how the collective knowledge of Do and Do Not have progressed. This is the only way to avoid arbitrary processes which both hold back empirical modeling and push it to make serious mistakes.

No comments: