Monday, February 07, 2005

Create Three Sampled Data Sets, not Two

One often sees an appeal to split data into two data sets for modeling: a training set and a testing set. The training set is used to build a model, and the testing set is used to assess the model. If the model accuracy on the training set is good, but on the testing set is poor, one has a good indication that the model has been overfit, or in other words, the model has picked up on patterns in the modeling data that are specific to the training data. In this case, the best course of action is to adjust parameters in the modeling algorithm so that a simpler model is created, whether it means fewer inputs in a model (for neural networks, regression, nearest neighbor, etc.), or fewer nodes or splits in the model (neural networks or decision trees). Then, retrain and retest the data to see if results have improved, particularly for the testing data.

However, if one does this several times, or even dozens of times (which is common), the testing data ceases to be an independent assessment of model performance because the testing data was used to change the inputs or algorithm parameters. Therefore, it is strongly recommended to have a third dataset to perform a final validation. This validation step should occur only after training and testing have provided confidence that the model is good enough to deploy.