Tuesday, November 07, 2006

Family Recipe For Neural Networks

I continue to read accounts of neural networks (specifically multilayer perceptrons) which characterize them as "slow" to train or "difficult" to configure. I suspect that most such descriptions are written by people with relatively little experience using neural networks. I believe that people who have enjoyed success with neural networks, like myself, have developed some standard procedure for configuring and training them. What follows is my basic neural network recipe. Season to taste.

Training to "Convergence"

Don't. Use early stopping instead. Use a separately drawn "tuning" data set to track the model fit. Train until the error on the "tuning" set reaches a minimum, then stop. Training further implies over-fitting.

The other sensible alternative is to train until convergence, but limit the number of hidden neurons. This is theoretically acceptable, but in practice requires longer training runs and more experimentation than early stopping.

Number of Hidden Layers

Use one hidden layer, always. In my experience, stacking on extra layers rarely improves performance enough to justify the computational effort. The time spent on longer training times and experimentation with the combination of 2 or more hidden layers is better spent on other performance-boosting efforts, such as dreaming up better derived features.

Number of Hidden Neurons

Training with early stopping means that over-fitting should not be an issue, so I generally don't worry about having "too many" hidden neurons unless memory use or training time really get out of hand. Since early stopping typically means shorter training runs (less passes over the data) than "training to convergence", one can afford a little experimentation here.

Surprisingly small hidden layers (as few as, say, 5 neurons) can often be effective, and are quick to train. If this does not produce satisfactory results, make the hidden layer larger by perhaps 50% at a time.

Scaling Inputs

If your software does not scale the input variables automatically, it may be beneficial to do so even though it is theoretically unneccessary. Wildly different input variable ranges can slow training.

Scaling Outputs

If your software does not do this automatically, this is neccessary as neural networks have a limited output range, either 0.0 to 1.0 or -1.0 to +1.0. Incidentally, unless I have a good reason, I don't bother with slightly reduced scales (like 0.001 to 0.999) as some authors suggest.

There are many other "tricks" one can use to turbo-charge performance, naturally, but this basic recipe has served me well over the years. Despite other analysts' stories of needing thousands of passes over the data, in my experience (with many thousands of training observations and over one hundred inputs), training with early stopping will often complete in less than 50 runs. Your mileage, of course, may vary.

What I've been reading lately: Common Errors in Statistics (and How to Avoid Them) Second Edition by Phillip I. Good and James W. Hardin (ISBN: 0471794317).


Anonymous said...

Another good reference to make ANN work is the book by Orr and Müller: Neural Networks: Tricks of the Trade (1999)

Dean Abbott said...

So many good "rules of thumb" here (I'll contact you directly about using some of them in my data mining course!)

I just got back from the SPSS Directions conference (I'll post on that in a day or two) and heard several times the usual mantra about neural networks beging too difficult to use, or too much of a "black box". I disagree and will post why I believe that is the case soon as well. Thanks for the insights.

Dean Abbott said...

The scaling tip is interesting. About 15 years ago I used to do the same with neural networks--take classification problems and rather than having the net predict "0" and "1", I'd change the outputs to "0.05" and "0.95" or so and the network converged faster. Why? I also suspected it was two things. First, the "0" and "1" values were at the fringe of the tails of the sigmoid where the derivative information is smallest. Second, a lot of software ran in single-precision mode then, and perhaps the lack of precision made the derivatives essentially 0.

At any rate, I don't do this anymore and don't think it matters as much these days, but haven't tested it again for awhile.

Alkanen said...

Regarding stopping early:
How do you ensure satisfactory results when you stop early?

Do you pick a new training/testing set pair and repeat the finetuning again ad infinitum until the error is small enough?

I've just recently begun using ANN, just for fun, and I've found that if I stop as the test set values begin to deteriorate, the output from the nets will be way too bad to be useful.

Will Dwinnell said...

I divide the data before training into "train" and "test" sets.