How to determine the number of layers and nodes of a neural network

Rules? As many as you want and none. Here is an excerpt from the Neural Network FAQ which is a good page to consult for basic questions:

  1. A: How many hidden units should I use? ==========================================

    There is no way to determine a good network topology just from the number of inputs and outputs. It depends critically on the number of training examples and the complexity of the classification you are trying to learn. There are problems with one input and one output that require millions of hidden units, and problems with a million inputs and a million outputs that require only one hidden unit, or none at all.
    Some books and articles offer "rules of thumb" for choosing a topopology -- Ninputs plus Noutputs dividied by two, maybe with a square root in there somewhere -- but such rules are total garbage. Other rules relate to the number of examples available: Use at most so many hidden units that the number of weights in the network times 10 is smaller than the number of examples. Such rules are only concerned with overfitting and are unreliable as well.

In your case, however, one can definitely say that the network is much too complex (even if you applied strong regularization). Why so many hidden layers? Start with one hidden layer -- despite the deep learning euphoria -- and with a minimum of hidden nodes. Increase the hidden nodes number until you get a good performance. Only if not I would add further layers. Further, use cross validation and appropriate regularization.


As they said, there is no "magic" rule to calculate the number of hidden layers and nodes of Neural Network, but there are some tips or recomendations that can helps you to find the best ones.

The number of hidden nodes is based on a relationship between:

  • Number of input and output nodes
  • Amount of training data available
  • Complexity of the function that is trying to be learned
  • The training algorithm

To minimize the error and have a trained network that generalizes well, you need to pick an optimal number of hidden layers, as well as nodes in each hidden layer.

  • Too few nodes will lead to high error for your system as the predictive factors might be too complex for a small number of nodes to capture

  • Too many nodes will overfit to your training data and not generalize well

You could find some general advices on this page:

Section - How many hidden units should I use?

If your data is linearly separable then you don't need any hidden layers at all. Otherwise there is a consensus on the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. Therefore, one hidden layer is sufficient for the large majority of problems.

There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'.

In sum, for most problems, one could probably get decent performance by setting the hidden layer configuration using just two rules:

  • The number of hidden layers equals one
  • The number of neurons in that layer is the mean of the neurons in the input and output layers.

Layers

As Yoshua Bengio, Head of Montreal Institute for Learning Algorithms remarks:

"Very simple. Just keep adding layers until the test error does not improve anymore."

A method recommended by Geoff Hinton is to add layers until you start to overfit your training set. Then you add dropout or another regularization method.

Nodes

For your task:

  • Input layer should contain 387 nodes for each of the features.
  • Output layer should contain 3 nodes for each class.
  • Hidden layers I find gradually decreasing the number with neurons within each layer works quite well (this list of tips and tricks agrees with this when creating autoencoders for compression tasks). Perhaps try 200 in first hidden layer and 100 in the second; again it's a hyper-parameter to be optimised and is very dependent on dataset size.