I continue to read accounts of neural networks (specifically multilayer perceptrons) which characterize them as "slow" to train or "difficult" to configure. I suspect that most such descriptions are written by people with relatively little experience using neural networks. I believe that people who have enjoyed success with neural networks, like myself, have developed some standard procedure for configuring and training them. What follows is my basic neural network recipe. Season to taste.
Training to "Convergence"
Don't. Use early stopping instead. Use a separately drawn "tuning" data set to track the model fit. Train until the error on the "tuning" set reaches a minimum, then stop. Training further implies over-fitting.
The other sensible alternative is to train until convergence, but limit the number of hidden neurons. This is theoretically acceptable, but in practice requires longer training runs and more experimentation than early stopping.
Number of Hidden Layers
Use one hidden layer, always. In my experience, stacking on extra layers rarely improves performance enough to justify the computational effort. The time spent on longer training times and experimentation with the combination of 2 or more hidden layers is better spent on other performance-boosting efforts, such as dreaming up better derived features.
Number of Hidden Neurons
Training with early stopping means that over-fitting should not be an issue, so I generally don't worry about having "too many" hidden neurons unless memory use or training time really get out of hand. Since early stopping typically means shorter training runs (less passes over the data) than "training to convergence", one can afford a little experimentation here.
Surprisingly small hidden layers (as few as, say, 5 neurons) can often be effective, and are quick to train. If this does not produce satisfactory results, make the hidden layer larger by perhaps 50% at a time.
If your software does not scale the input variables automatically, it may be beneficial to do so even though it is theoretically unneccessary. Wildly different input variable ranges can slow training.
If your software does not do this automatically, this is neccessary as neural networks have a limited output range, either 0.0 to 1.0 or -1.0 to +1.0. Incidentally, unless I have a good reason, I don't bother with slightly reduced scales (like 0.001 to 0.999) as some authors suggest.
There are many other "tricks" one can use to turbo-charge performance, naturally, but this basic recipe has served me well over the years. Despite other analysts' stories of needing thousands of passes over the data, in my experience (with many thousands of training observations and over one hundred inputs), training with early stopping will often complete in less than 50 runs. Your mileage, of course, may vary.
What I've been reading lately: Common Errors in Statistics (and How to Avoid Them) Second Edition by Phillip I. Good and James W. Hardin (ISBN: 0471794317).