Thursday, May 05, 2011

Number of Hidden Layer Neurons to Use

In the linkedin.com Artificial Neural Networks group, a question arose about how many hidden neurons one should choose. I've never found a fully satisfactory answer to this, but there is quite a lot of guesses and rules of thumb out there.

I've always like Warren Sarle's neural network FAQ that includes a discussion on this topic.

There is another reference on the web that I agree with only about 50%, but the references are excellent: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html.

My personal preference is to use software that experiments with multiple architectures and selects the one that performs best on held-out data. Better still are the algorithms that also select (i.e. prune) inputs as well. As I teach in my courses, I've spent far too many hours in my life selection neural network architectures and re-training, so I'd much rather let the software do it for me.

11 comments:

Will Dwinnell said...

Assuming reasonable choices for other architectural features (transfer function, weight initialization, etc.), my preference is to always use a single hidden layer, and limit experimentation. Starting with a fairly small number of hidden (3 or 4 nodes), multiply up by about 50% each iteration (more if you want to be more aggressive).

Exploring many other combinations can be extremely time consuming and is unlikely to be fruitful.

Tim Manns said...

On the occasions I configured the hidden layer I always went with 1 layer, but used excessive numbers of hidden neurons. I recall reading/hearing ages ago (have no clue of the source) that larger numbers of hidden neurons make it less likely that a sub-optimal/local solution will be found. For this reason I err on the side of caution and specify a large (in comparion to the output layer) number of hidden neurons.
As a habit I also often chuck lots of inputs at a simple backprop NN. I know I should probably account for colonearity, but that's one thing I often drop due to time constraints and instead focus on good validation against true holdout samples. Based on this type of validation testing I haven't seen problems with being generous with the number of hidden neurons. To me it just means things run a bit slower...

TimManns

Anonymous said...
This comment has been removed by a blog administrator.
Dean Abbott said...

Will: I completely agree that combinations are time consuming. In courses I always say I've spent too many hours of my life selecting NN architectures. I prefer using a software implementation that tries them automatically, and better yet, will prune inputs that aren't useful too. The only time I use 2 hidden layers now is when I have a very complex problem and I just want to "give it a shot".

Cristian Mesiano said...

Years ago, I was playing with Recursive neural networks to learn brain ERP waves.
To find the best network topology I arranged a sort of gradient technique to analyze the best fitting of hidden layers versus expected answer.
The network model was a custom back propagation (but in principle my it should work also with the standard ...and IMHO boring multi layer net).
I wasn't able to find a generalized model, but analyzing the gradient descending in a specific domain I remember that the strategy worked very well.
BTW... with the SVM I left completely the neural network paradigm: too much weak!
REgards
http://textanddatamining.blogspot.com/

Anonymous said...

Christian, have you really found SVMs to yield more accurate results? In general I find that people promote SVMs over ANN, but when you look into their experiments, the bias away from ANN is evident and they've set up ANN as a straw man. I've had multiple people tell me that if I used linear-SVM instead of Neural Nets id see x% perfomance gain. In my own experience though, I haven't seen performance improvements from SVMs (could be it's just the domain I work in).

As for hidden units, I've found that once you hone in on the general range of hidden units that works best, there isn't much difference between small changes to the number. For example, 30 hidden units and 35 hidden units may not make much difference, though decreasing it to 15 would. I've found more variation from changes to the learning rate than I have from small hidden unit changes.

Dean Abbott said...

I generally agree with Anonymous on this one--I have seen no evidence that SVMs are uniformly better than NNets either in my own work or in data mining competitions (where experts will optimize the algorithms). I fully acknowledge that I'm better at NNets than SVMs, so I look to these other indications of performance as better validation methods. Will, what do you think?

Cristian Mesiano said...

@Dean, of course we obtain better results using methods that we know better.
Maybe my strong opinion is conditioned by my love for the statistical theory behind the SVM (with special regard for the empirical risk based on vapnik chervonenkis dimension).
BTW, reading my old notes written during the nights lost trying to find out the best net topology, I've found a trick I used to boost the learning of the net: basically instead of to work with the hidden layers and hidden neurons, I duplicated the expected output.
This heuristic gave to my experiments good results, but I'm not sure that it can used in every context.
More details (and some tests on trivial example) are reported on my blog at:
http://textanddatamining.blogspot.com/2011/11/neural-nets-tips-and-tricks-add-recall.html
…sometimes a dirty trick like that can do better than complex algorithms.
Cheers
cristian
(hoping that my English is enough understandable)

Will Dwinnell said...

I seriously doubt that any algorithm will best all others at the typical mix of problems. I have heard heavy promotion of SVM, but have not seen evidence that it is universally better than neural networks or anything else for that matter. I figure that SVM will eventually become another tool in the toolbox.

Dean Abbott said...

Cristian:

If I understand what you did correctly, the second output neuron will have weights that in general would be different. In general, it seems to me, this would give two paths for correction with the hidden layer neurons (though of course not with the output layer neurons).

Given this, it seems that the network would likely train faster (depending someone on the random initialization of weights in the output layer neurons), which is nice. But what about accuracy? At first I thought you were going to comine the predictions of these output neurons into some sort of ensemble. I've never thought of building a neural network ensemble in that way before and wonder if it could work.

Dean Abbott said...

Will:

I'm 110% with you. I suspect that most SVM proponents (who say SVM is superior to, say, a neural network) really are describing their expertise with SVM over and above their expertise with nets. I haven't seen any indication that they are inherently superior on actual problems.

Given that ensembles of many kinds have been winning these competitions, it seems that these ensembles give you more than choose a specific algorithm.