Estimating the number of neurons and number of layers of an artificial neural network

This is a really hard problem.

The more internal structure a network has, the better that network will be at representing complex solutions. On the other hand, too much internal structure is slower, may cause training to diverge, or lead to overfitting -- which would prevent your network from generalizing well to new data.

People have traditionally approached this problem in several different ways:

  1. Try different configurations, see what works best. You can divide your training set into two pieces -- one for training, one for evaluation -- and then train and evaluate different approaches. Unfortunately it sounds like in your case this experimental approach isn't available.

  2. Use a rule of thumb. A lot of people have come up with a lot of guesses as to what works best. Concerning the number of neurons in the hidden layer, people have speculated that (for example) it should (a) be between the input and output layer size, (b) set to something near (inputs+outputs) * 2/3, or (c) never larger than twice the size of the input layer.

    The problem with rules of thumb is that they don't always take into account vital pieces of information, like how "difficult" the problem is, what the size of the training and testing sets are, etc. Consequently, these rules are often used as rough starting points for the "let's-try-a-bunch-of-things-and-see-what-works-best" approach.

  3. Use an algorithm that dynamically adjusts the network configuration. Algorithms like Cascade Correlation start with a minimal network, then add hidden nodes during training. This can make your experimental setup a bit simpler, and (in theory) can result in better performance (because you won't accidentally use an inappropriate number of hidden nodes).

There's a lot of research on this subject -- so if you're really interested, there is a lot to read. Check out the citations on this summary, in particular:

  • Lawrence, S., Giles, C.L., and Tsoi, A.C. (1996), "What size neural network gives optimal generalization? Convergence properties of backpropagation". Technical Report UMIACS-TR-96-22 and CS-TR-3617, Institute for Advanced Computer Studies, University of Maryland, College Park.

  • Elisseeff, A., and Paugam-Moisy, H. (1997), "Size of multilayer networks for exact learning: analytic approach". Advances in Neural Information Processing Systems 9, Cambridge, MA: The MIT Press, pp.162-168.


In practice, this is not difficult (based on having coded and trained dozens of MLPs).

In a textbook sense, getting the architecture "right" is hard--i.e., to tune your network architecture such that performance (resolution) cannot be improved by further optimization of the architecture is hard, i agree. But only in rare cases is that degree of optimization required.

In practice, to meet or exceed the prediction accuracy from a neural network required by your spec, you almost never need to spend a lot of time with the network architecture--three reasons why this is true:

  • most of the parameters required to specify the network architecture are fixed once you have decided on your data model (number of features in the input vector, whether the desired response variable is numerical or categorical, and if the latter, how many unique class labels you've chosen);

  • the few remaining architecture parameters that are in fact tunable, are nearly always (100% of the time in my experience) highly constrained by those fixed architecture parameters--i.e., the values of those parameters are tightly bounded by a max and min value; and

  • the optimal architecture does not have to be determined before training begins, indeed, it is very common for neural network code to include a small module to programmatically tune the network architecture during training (by removing nodes whose weight values are approaching zero--usually called "pruning.")

enter image description here

According to the Table above, the architecture of a neural network is completely specified by six parameters (the six cells in the interior grid). Two of those (number of layer type for the input and output layers) are always one and one--neural networks have a single input layer and a single output layer. Your NN must have at least one input layer and one output layer--no more, no less. Second, the number of nodes comprising each of those two layers is fixed--the input layer, by the size of the input vector--i.e., the number of nodes in the input layer is equal to the length of the input vector (actually one more neuron is nearly always added to the input layer as a bias node).

Similarly, the output layer size is fixed by the response variable (single node for numerical response variable, and (assuming softmax is used, if response variable is a class label, the the number of nodes in the output layer simply equals the number of unique class labels).

That leaves just two parameters for which there is any discretion at all--the number of hidden layers and the number of nodes comprising each of those layers.

The Number of Hidden Layers

if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. (If that's in fact the case, i would not use a NN for this problem--choose a simpler linear classifier). The first of these--the number of hidden layers--is nearly always one. There is a lot of empirical weight behind this presumption--in practice very few problems that cannot be solved with a single hidden layer become soluble by adding another hidden layer. Likewise, there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. One hidden layer is sufficient for the large majority of problems.

In your question, you mentioned that for whatever reason, you cannot find the optimum network architecture by trial-and-error. Another way to tune your NN configuration (without using trial-and-error) is 'pruning'. The gist of this technique is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look for weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.

Put another way, by applying a pruning algorithm to your network during training, you can much closer to an optimized network configuration than any a priori theory is ever likely to give you.

The Number of Nodes Comprising the Hidden Layer

but what about the number of nodes comprising the hidden layer? Granted this value is more or less unconstrained--i.e., it can be smaller or larger than the size of the input layer. Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the famous NN FAQ for an excellent summary of that commentary). There are many empirically-derived rules-of-thumb, but of these, the most commonly relied on is the size of the hidden layer is between the input and output layers. Jeff Heaton, author of "Introduction to Neural Networks in Java" offers a few more, which are recited on the page i just linked to. Likewise, a scan of the application-oriented neural network literature, will almost certainly reveal that the hidden layer size is usually between the input and output layer sizes. But between doesn't mean in the middle; in fact, it is usually better to set the hidden layer size closer to the size of the input vector. The reason is that if the hidden layer is too small, the network might have difficultly converging. For the initial configuration, err on the larger size--a larger hidden layer gives the network more capacity which helps it converge, compared with a smaller hidden layer. Indeed, this justification is often used to recommend a hidden layer size larger than (more nodes) the input layer--ie, begin with an initial architecture that will encourage quick convergence, after which you can prune the 'excess' nodes (identify the nodes in the hidden layer with very low weight values and eliminate them from your re-factored network).