Monday, March 14, 2005

Use Priors to Balance Class Counts

One well-known difficulty in building classification models occurs when one class vastly outnumbers the other classes. For example, if the output (target) variable has 95% 0s and 5% 1s, a neural network could predict every record will be 0 and have 95% accuracy. Of course, this is a meaningless model. This occurs when there are contradictions in the data, that is, when there are input patterns in the data with output patterns containing both 0s and 1s. If there are more records with an output variable value equal to 0, the classifier will choose 0 as the more likely answer.

The most common correction to make when building neural networks for data with a large imbalance of class counts is to merely balance the counts of 0s and 1s by removing records with excess 0s, or by duplicating records with 1s. That issue will be covered in a future issue of Abbott Insights™.

However, some algorithms have a way to accomplish this balancing without sampling by specifying the expected prior probability of each class value (priors). The CART decision tree algorithm is one algorithm with settings to do this. The advantage is that no data is thrown away, yet the classifier won’t favor the overrepresented class value over the underrepresented one.