Friday, November 03, 2006

In Praise of Simple Models

At a time when ever more subtle and complex modeling techniques are emerging, it is interesting to note the continuing effectiveness of comparatively simple modeling methods, such as logistic regression. In my work over the past so many years in the finance industry, I have found repeated success employing such methods. One reason for their success is that data may not be as complex as is commonly believed, especially in high-dimensional spaces. See writings on Holte's 1R algorithm as an example. Also, labeling a model or modeling technique as "simple" may be deceptive. With the exception of clinical or academic settings, the majority of real-world models based on transformed (or even un-transformed) linear functions are helped by the use of derived features which make the model function potentially very complex in the space of raw variables.

Some observations:

1. Simple models are typically very fast to train, allowing more time for handling other aspects of the modeling problem, such as attribute selection.

2. Most non-technical people are more comfrotable with explanations of linear-based models than any other kind. In regulated industries, this is of tremendous benefit.

3. When most modelers think of "simple models", linear regression, linear discriminants and logistic regression come to mind, but there are other, less-well known options, such as extreme-value regression (also called complementary log-log regression). Indeed, the transfer function, where one is needed, can be any monotonic function. Further, the linear portion of the model need not be fit by traditional methods. I trained a linear model via a global optimizer to maximize class separation (AUC) for a commercial application. This model was found to be highly effective and presently has been in service for 18 months.

4. Complex models can be built out of collections of simple ones. One of my recent responsibilities was to create and maintain a predictive model used in the management of several billion dollars worth of assets. Any individual case falls into one of 20 mutually-exclusive segements, each with its own logistic regression.

I recommend Generalized Linear Models by McCullagh and Nelder for readers interested in exploring this subject further.

No comments: