Thursday, November 30, 2006

Error Measures

All models must be assessed somehow. Despite the existence of a bewildering array of performance measures, much commercial modeling software provides a surprisingly limited range of options. I will provide a short introduction of such measures in this article.

Numerical Models (Regressions)

Mean Squared Error (MSE) is by far the most common measure of numerical model performance. It is simply the average of the squares of the differences between the predicted and actual values. It is a reasonably good measure of performance, though it could be argued that it overemphasizes the importance of larger errors. Many modeling procedures directly minimize the MSE.

Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it uses absolute values instead of squaring. This measure is not as popular as MSE, though its meaning is more intuitive (the "average error").

Bias is the average of the differences between the predicted and actual values. With this measure, positive errors cancel out negative ones. Bias is intended to assess how much higher or lower predictions are, on average, than actual values.

Mean Absolute Percent Error (MAPE) is the average of the absolute errors, as a percentage of the actual values. This is a relative measure of error, which is useful when larger errors are more acceptable on larger actual values.


Classifiers come in two basic varieties: those which produce class outputs, and those which produce probabilities of classes.

Classifiers: Class Output

Accuracy is the proportion of the time that the predicted class equals the actual class, usually expressed as a percentage. It's meaning is straightforward, but may obscure important differences in costs associated with different errors. The classic example of such costs is the medical diagnostic situation, in which one can err be either: 1. keeping a healthy patient in the hospital (low cost), or 2. sending home a sick patient (very high cost).

Classifiers: Probability Output

These classifiers need to be checked for both the accuracy of their probabilities (Do cases predicted to have a 5% (30%, 80%, etc.) probability really belong to the target class 5% (30%, 80%, etc.) of the time?) and their ability to separate the classes in question.

Accuracy can be measured using many of the same metrics used to evaluate numerical models (MSE, MAE, etc.). One interesting alternative which is specific to classification, the informational loss, is based on information theory and is described in Data Mining by Witten and Frank (ISBN 1-55860-552-5).

Some applications (as in marketing) are focused on how many items from the target class can be identified in the best so-many percent of the population. If for example, one only has the resources to mail marketing literature to 10% of the customer file, the ideal would be to pack as many actual respondents as possible into that best 10%. The mirror situation is typified by lenders who wish to cram as many bad loans as possible into the worst 10% of their file. Probably the most popular measure of class separation at present in the literature is the Area Under the ROC Curve (AUC or AUROC), which is like measuring separation across the whole spectrum.

The intrepid data miner is invited to explore these performance measures and related topics on his or her own:

confusion matrix
sensitivity and specificity


Dean Abbott said...

This is a nice summary of metrics. The problem I have with error measures, especially for comparing classifier solutions is that they often don't measure what we're interested in. When I build a fraud detection model, I frankly don't care about what goes on with most of the data--I just want the very highest confidence or probability values be related to fraud. Put in other terms, I want something like the top decile to perform best, even if the bottom 9 deciles are not as good.

In my data mining courses I beat this drum (perhaps too much) to match the evaluation criterion one uses to select and grade models as much as is possible to the business objective. Often this will mean rank-ordering the predictions from highest to lowest, and then selecting the top N% of the list (marketing folks often use Lift, radar and sonar folks like ROC curves to trade off false alerts with hits--these are nearly identical ways of viewing the model predictive results).

Many data mining software packages now allow you to rank models by some criterion like this (ROC, Lift, Gains, etc.), and I think for most practitioners, this is the preferred way to assess model performance and select models.


with my very small experince wodatamining involving regression problems. i think correlation coeffcient or squared correleation coeffcient between outputs and expected values stand best measure to know the accuracy of a model alos i think a good model should have good leave one out correleation coefficient.

Anonymous said...

If the relationship between the actual and predicted output is linear, then the correlation coefficient is a good measure.

However, if it is not, then r is at best poor, and at worst misleading. If, for example, the model errors are heteroskedastic, then, yes, you can transform the inputs and/or output to correct the problem so that the correlation coefficient or R^2 can be used as a good metric for assessing errors. But this can be time consuming in itself, especially when large numbers of inputs are candidates for the model.

But even leaving that aside, my point was that for many applications I'm involved with (direct mail/response modeling and fraud detection are just two examples), a metric that assesses a model based on all the data (like R^2 does) just isn't necessary--I only care about how well the model does on a subset of the data. For a response model, as long as the top 3 deciles provide good lift, I don't care if the rest of the file is rank-ordered well.

So I agree with you on R^2, but conditionally, and in most of, but not all of, the modeling I do, I prefer using another metric.