Neural Network returning NaN as output

With a quick look, and based on the analysis of your multiplier variants, it seems like the NaN is produced by an arithmetic underflow, caused by your gradients being too small (too close to absolute 0).

This is the most suspicious part of the code:

 f[j + i * 7] = (rows[j][i] == 0 ? .5f : rows[j][i] == 1 ? 0f : 1f);

If rows[j][i] == 1 then 0f is stored. I don't know how this is managed by the neural network (or even java), but mathematically speaking, a finite-sized float cannot include zero.

Even if your code would alter the 0f with some extra salt, those array values' resultants would have some risk of becoming too close to zero. Due to limited precision when representing real numbers, values very close to zero can not be represented, hence the NaN.

These values have a very friendly name: subnormal numbers.

Any non-zero number with magnitude smaller than the smallest normal number is subnormal.

enter image description here

IEEE_754

As with IEEE 754-1985, The standard recommends 0 for signaling NaNs, 1 for quiet NaNs, so that a signaling NaNs can be quieted by changing only this bit to 1, while the reverse could yield the encoding of an infinity.

Above's text is important here: according to the standard, you are actually specifying a NaN with any 0f value stored.


Even if the name is misleading, Float.MIN_VALUE is a positive value,higher than 0:

enter image description here

The real minimum float value is, in fact: -Float.MAX_VALUE.

Is floating point math subnormal?


Normalizing the gradients

If you check the issue is only because of the 0f values, you could just alter them for other values that represent something similar; Float.MIN_VALUE, Float.MIN_NORMAL, and so on. Something like this, also in other possible parts of the code where this scenario could happen. Take these just as examples, and play with these ranges:

rows[j][i] == 1 ? Float.MIN_VALUE : 1f;

rows[j][i] == 1 ?  Float.MIN_NORMAL : Float.MAX_VALUE/2;

rows[j][i] == 1 ? -Float.MAX_VALUE/2 : Float.MAX_VALUE/2;

Even so, this could also lead to a NaN, based on how these values are altered. If so, you should normalize the values. You could try applying a GradientNormalizer for this. In your network initialization, something like this should be defined, for each layer(or for those who are problematic):

new NeuralNetConfiguration
  .Builder()
  .weightInit(WeightInit.XAVIER)
  (...)
  .layer(new DenseLayer.Builder().nIn(42).nOut(30).activation(Activation.RELU)
        .weightInit(WeightInit.XAVIER)
        .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer) //this   
        .build())
  
  (...)

There are different normalizers, so choose which one fits your schema best, and which layers should include one. The options are:

GradientNormalization

  • RenormalizeL2PerLayer

    Rescale gradients by dividing by the L2 norm of all gradients for the layer.

  • RenormalizeL2PerParamType

    Rescale gradients by dividing by the L2 norm of the gradients, separately for each type of parameter within the layer. This differs from RenormalizeL2PerLayer in that here, each parameter type (weight, bias etc) is normalized separately. For example, in a MLP/FeedForward network (where G is the gradient vector), the output is as follows:

    GOut_weight = G_weight / l2(G_weight) GOut_bias = G_bias / l2(G_bias)

  • ClipElementWiseAbsoluteValue

    Clip the gradients on a per-element basis. For each gradient g, set g <- sign(g) max(maxAllowedValue,|g|). i.e., if a parameter gradient has absolute value greater than the threshold, truncate it. For example, if threshold = 5, then values in range -5<g<5 are unmodified; values <-5 are set to -5; values >5 are set to 5.

  • ClipL2PerLayer

    Conditional renormalization. Somewhat similar to RenormalizeL2PerLayer, this strategy scales the gradients if and only if the L2 norm of the gradients (for entire layer) exceeds a specified threshold. Specifically, if G is gradient vector for the layer, then:

    GOut = G if l2Norm(G) < threshold (i.e., no change) GOut = threshold * G / l2Norm(G)

  • ClipL2PerParamType

    Conditional renormalization. Very similar to ClipL2PerLayer, however instead of clipping per layer, do clipping on each parameter type separately. For example in a recurrent neural network, input weight gradients, recurrent weight gradients and bias gradient are all clipped separately.


Here you can find a complete example of the application of these GradientNormalizers.