Thursday, November 30, 2006

Error Measures

All models must be assessed somehow. Despite the existence of a bewildering array of performance measures, much commercial modeling software provides a surprisingly limited range of options. I will provide a short introduction of such measures in this article.

Numerical Models (Regressions)

Mean Squared Error (MSE) is by far the most common measure of numerical model performance. It is simply the average of the squares of the differences between the predicted and actual values. It is a reasonably good measure of performance, though it could be argued that it overemphasizes the importance of larger errors. Many modeling procedures directly minimize the MSE.

Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it uses absolute values instead of squaring. This measure is not as popular as MSE, though its meaning is more intuitive (the "average error").

Bias is the average of the differences between the predicted and actual values. With this measure, positive errors cancel out negative ones. Bias is intended to assess how much higher or lower predictions are, on average, than actual values.

Mean Absolute Percent Error (MAPE) is the average of the absolute errors, as a percentage of the actual values. This is a relative measure of error, which is useful when larger errors are more acceptable on larger actual values.


Classifiers come in two basic varieties: those which produce class outputs, and those which produce probabilities of classes.

Classifiers: Class Output

Accuracy is the proportion of the time that the predicted class equals the actual class, usually expressed as a percentage. It's meaning is straightforward, but may obscure important differences in costs associated with different errors. The classic example of such costs is the medical diagnostic situation, in which one can err be either: 1. keeping a healthy patient in the hospital (low cost), or 2. sending home a sick patient (very high cost).

Classifiers: Probability Output

These classifiers need to be checked for both the accuracy of their probabilities (Do cases predicted to have a 5% (30%, 80%, etc.) probability really belong to the target class 5% (30%, 80%, etc.) of the time?) and their ability to separate the classes in question.

Accuracy can be measured using many of the same metrics used to evaluate numerical models (MSE, MAE, etc.). One interesting alternative which is specific to classification, the informational loss, is based on information theory and is described in Data Mining by Witten and Frank (ISBN 1-55860-552-5).

Some applications (as in marketing) are focused on how many items from the target class can be identified in the best so-many percent of the population. If for example, one only has the resources to mail marketing literature to 10% of the customer file, the ideal would be to pack as many actual respondents as possible into that best 10%. The mirror situation is typified by lenders who wish to cram as many bad loans as possible into the worst 10% of their file. Probably the most popular measure of class separation at present in the literature is the Area Under the ROC Curve (AUC or AUROC), which is like measuring separation across the whole spectrum.

The intrepid data miner is invited to explore these performance measures and related topics on his or her own:

confusion matrix
sensitivity and specificity

Tuesday, November 14, 2006

Ensembles everywhere

After reading the ensembles of cluster article recently, I just say another ariticle in IEEE PAMI entitled "Rotation Forest: A New Classifier Ensemble Method". The approach is interesting: much like Random Forests (where the diversity of trees used in the ensemble are developed by both bootstrap sampling and random variable selection), there is a random selection of variables to use in trees. But the twist here (and the "rotation") is by using PCA on the random subset of candidate variables.

I'm sure there are near infinite ways to tweak the ideas of random record/variable selection, but once again the keys to the success of ensembles here and always are:
1) Diversity in information (i.e., data) the modeling algorithm sees.
2) Algorithms that are weak learners benefit from ensembles

Random Forests, it seems to me, works well because not only are trees coarse, blunt (and unstable) predictors, but they are greedy searches that can be fooled into going down a sub-optimal path. By constraining the splits to contain only some of the variables, the tree is forced out of it's greedy perspective to consider other ways to achieve the solution. This new algorithm does the same thing, with the twist of using PCA to develop linear projections of the original data (subsets to be more precise).

I think we'll be seing more and more variations on the same theme in the coming years.

Free And Inexpensive Data Mining Software

I recently came across an article on-line which included the claim that data mining was "hugely expensive". I disagree. Given reasonably capable desktop hardware, and a qualified data miner (who is the most expensive component of data mining cost!), a variety of capable data mining software packages are available for free or relatively little (meaning < US$100). I cannot vouch that any one of these tools is a good fit for your specific application, but they are certainly worth a look:

SNNS (Stuttgart Neural Network Simulator)
YALE (Yet Another Learning Environment)

There are some commercial tools which sell for less than US$250, such as:


Of course, there is always the "roll-your-own" approach, in which the data miner constructs or gathers his or her own tools. The internet houses a wealth of source code in a variety of languages. Aside from searching on general-purpose Internet search engines for things like:

"quadratic discriminant" "source code"


backpropagation Java

...there are also source code repositories, such as:

Google Code
MATLAB Central

Sunday, November 12, 2006

Seeking the "Best" Model or, Data Mining's El Dorado

As their explorations of South America in the 1500s progressed, Spanish explorers encountered stories of El Dorado, originally "a man covered in gold", but eventually coming to mean a lost city of gold. These stories varied in the details, but at the center of all of them was the gold. Such legends circulated widely and provoked considerable interest. Significant energy (and time and money and lives) were spent in an effort to discover the location of El Dorado, without success.

There apparently was a real ritual in part of South America involving a king covered in gold dust, which perhaps gave rise to the legend. It has also been theorized that the people inhabiting lands being consumed by the Spanish empire essentially told the Spanish whatever they wanted to hear- whatever it took to get them to go away. Either way, the Spanish searched in vain for something that wasn't there.

Today, organizations search for models which approximate reality. A frequent feature of this search, particularly prevalent among novices is the search for the "best" model. Models vary in their performance, with some clearly being "better" than others. It is natural enough, then, to think that data mining is an undertaking whose ideal outcome is the "best" model.

Some time ago, I was being interviewed by auditors regarding methods I used in building models of customer behavior. They asked questions about various aspects data gathering, model building and testing methods. Their questions eventually came around to the issue of how I knew that my models were the best possible. I told them that I did not know that my models were the best possible. I told them that, in fact, I knew that my models were not the best possible. I said that I knew that my models were technically sound, had historically been reliable over long periods of time and that they answered the needs of the business. Searching for a "best possible" model, I asserted, was an academic exercise.

How much time a given modeling effort deserves is an open question, and I am not advocating doing a bad job. Remember, though, that with enough data, a search for better models could continue indefinitely, but this would be a fool's errand. There will always be modeling techniques and derived variables not yet explored. That a deployed model fails to meet some unknown and theoretical ideal is only important as a missed opportunity (but a potentially very expensive one to explore). Much more important is whether a model meets a business need and whether it is likely to continue to do so. This implies that time in data mining is better spent validating models, rather than exhaustively chasing trivial improvements in apparent performance.

Small News Item: I have started another, tool-specific Web log, Data Mining in MATLAB.

Tuesday, November 07, 2006

Family Recipe For Neural Networks

I continue to read accounts of neural networks (specifically multilayer perceptrons) which characterize them as "slow" to train or "difficult" to configure. I suspect that most such descriptions are written by people with relatively little experience using neural networks. I believe that people who have enjoyed success with neural networks, like myself, have developed some standard procedure for configuring and training them. What follows is my basic neural network recipe. Season to taste.

Training to "Convergence"

Don't. Use early stopping instead. Use a separately drawn "tuning" data set to track the model fit. Train until the error on the "tuning" set reaches a minimum, then stop. Training further implies over-fitting.

The other sensible alternative is to train until convergence, but limit the number of hidden neurons. This is theoretically acceptable, but in practice requires longer training runs and more experimentation than early stopping.

Number of Hidden Layers

Use one hidden layer, always. In my experience, stacking on extra layers rarely improves performance enough to justify the computational effort. The time spent on longer training times and experimentation with the combination of 2 or more hidden layers is better spent on other performance-boosting efforts, such as dreaming up better derived features.

Number of Hidden Neurons

Training with early stopping means that over-fitting should not be an issue, so I generally don't worry about having "too many" hidden neurons unless memory use or training time really get out of hand. Since early stopping typically means shorter training runs (less passes over the data) than "training to convergence", one can afford a little experimentation here.

Surprisingly small hidden layers (as few as, say, 5 neurons) can often be effective, and are quick to train. If this does not produce satisfactory results, make the hidden layer larger by perhaps 50% at a time.

Scaling Inputs

If your software does not scale the input variables automatically, it may be beneficial to do so even though it is theoretically unneccessary. Wildly different input variable ranges can slow training.

Scaling Outputs

If your software does not do this automatically, this is neccessary as neural networks have a limited output range, either 0.0 to 1.0 or -1.0 to +1.0. Incidentally, unless I have a good reason, I don't bother with slightly reduced scales (like 0.001 to 0.999) as some authors suggest.

There are many other "tricks" one can use to turbo-charge performance, naturally, but this basic recipe has served me well over the years. Despite other analysts' stories of needing thousands of passes over the data, in my experience (with many thousands of training observations and over one hundred inputs), training with early stopping will often complete in less than 50 runs. Your mileage, of course, may vary.

What I've been reading lately: Common Errors in Statistics (and How to Avoid Them) Second Edition by Phillip I. Good and James W. Hardin (ISBN: 0471794317).

Sunday, November 05, 2006

Keep Your Eye On The Problem

An occupational hazard of statistical modeling is to view the model as the solution to the problem. Direct application of modeling in an effort to leap directly from the available data to a final result is not always feasible or efficient. Models are always part of a solution, and guaging how large a role they play in the solution is important.

Consider the solution described in the paper, Fuzzy Decision System for Threshold Selection to Cluster Cauliflower Plant Blobs From Field Visual Images, by L. García-Pérez, M. C. García-Alegre, J. Marchant and T. Hague, which involves an agricultural machine vision problem. One fundamental step in the solution of this particular problem is segmentation of the digitized image into two classes: plants and background. In this case, the authors have elected to employ a single threshold of brightness to be applied to the entire image, above which pixels will be regarded as plants, and below which pixels will be regarded as background.

A complete solution to this problem accepts image data as input and produces a suitable threshold value as output. Having gotten in the habit of viewing the model as the solution, it is easy for the modeler to prematurely conclude that his or her model should, by itself, bridge this entire gap.

In the paper mentioned above, however, the authors took a different perspective. They constructed a model (a "policy", really- in this case composed of fuzzy logic rules) which indicates only by how much a given threshold should be incremented or decremented to improve performance. The threshold is initialized and updated by iteratively firing the model to modify the threshold. This is essentially a search across candidate threshold values. When the threshold value stabilizes (when the model indicates zero suggested change in the threshold), the process is finished and the final threshold is output. Note that the model does not directly determine the threshold. This model accepts image data and a candidate threshold as input and produces a suggested change for the threshold value as output. The total solution, which does accept image data as input and produces a suitable threshold value as output is composed of the fuzzy logic model as well as a simple looping search mechanism.

I suggest that it is worth trying to find the best way to leverage the power of modeling, whether as a complete solution unto itself or otherwise. Re-arranging what, exactly, is treated as model input and model output is one way to accomplish this.

Friday, November 03, 2006

In Praise of Simpler Models -- at least in practice

Originally, this was goingt o be a comment, but it became so long, that I'm just posting it.

I'll never forget the hubbub that was generated by a talk by Pedros Domingos, I think at the 1998 KDD conference in New York City. His talk, "Occam's Two Razors: The Sharp and the Blunt" turned the usually calm and measured data mining crowd into an agitated hoard! I mean there was even yelling there (and if you've never been to a conference of statisticians or data miner, you won't know how unbelievable this scene was. This is from the Abstract, also posted at Citeseer in the link above:

Occam's razor has been the subject of much controversy. This paper argues
that this is partly because it has been interpreted in two quite different ways,
the first of which (simplicity is a goal in itself) is essentially correct,
while the second (simplicity leads to greater accuracy) is not. The paper
reviews the large variety of theoretical arguments and empirical evidence for
and against the "second razor," and concludes that the balance is strongly
against it.

John Elder provided a very useful and reasoned explanation for the apparent difficulties with Occam's Razor here (in a summary of KDD '98 presented at my favorite conference, the Symposium on the Interface). To quote,

Jianming Ye presented an intriguing alternative metric, Generalized
Degrees of Freedom, which finds the sensitivity of the fitted values to
perturbations in the outputs [5]. This allows complexity to be measured
and compared in an entirely experimental manner [Ye, J. (March 1998). On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, no. 441, pp. 120-131.]

The key idea is this: just because a model has lots of weights or splits, doesn't mean it is acting in a complex way. John has shown subsequently (at the most recent Salford Systems conference in San Diego, March 2006) that using the GDoF idea, ensembles of trees can actually behave less complex than individual trees within the ensemble.

So I agree with the idea that simpler models are usually better, and for that reason, I always try linear models first. However, sometimes the best models in terms of accuracy are complex looking, but in reality behave in relatively simple ways.

In Praise of Simple Models

At a time when ever more subtle and complex modeling techniques are emerging, it is interesting to note the continuing effectiveness of comparatively simple modeling methods, such as logistic regression. In my work over the past so many years in the finance industry, I have found repeated success employing such methods. One reason for their success is that data may not be as complex as is commonly believed, especially in high-dimensional spaces. See writings on Holte's 1R algorithm as an example. Also, labeling a model or modeling technique as "simple" may be deceptive. With the exception of clinical or academic settings, the majority of real-world models based on transformed (or even un-transformed) linear functions are helped by the use of derived features which make the model function potentially very complex in the space of raw variables.

Some observations:

1. Simple models are typically very fast to train, allowing more time for handling other aspects of the modeling problem, such as attribute selection.

2. Most non-technical people are more comfrotable with explanations of linear-based models than any other kind. In regulated industries, this is of tremendous benefit.

3. When most modelers think of "simple models", linear regression, linear discriminants and logistic regression come to mind, but there are other, less-well known options, such as extreme-value regression (also called complementary log-log regression). Indeed, the transfer function, where one is needed, can be any monotonic function. Further, the linear portion of the model need not be fit by traditional methods. I trained a linear model via a global optimizer to maximize class separation (AUC) for a commercial application. This model was found to be highly effective and presently has been in service for 18 months.

4. Complex models can be built out of collections of simple ones. One of my recent responsibilities was to create and maintain a predictive model used in the management of several billion dollars worth of assets. Any individual case falls into one of 20 mutually-exclusive segements, each with its own logistic regression.

I recommend Generalized Linear Models by McCullagh and Nelder for readers interested in exploring this subject further.