keras BatchNormalization axis clarification

The confusion is due to the meaning of axis in np.mean versus in BatchNormalization.

When we take the mean along an axis, we collapse that dimension and preserve all other dimensions. In your example data.mean(axis=0) collapses the 0-axis, which is the vertical dimension of data.

When we compute a BatchNormalization along an axis, we preserve the dimensions of the array, and we normalize with respect to the mean and standard deviation over every other axis. So in your 2D example BatchNormalization with axis=1 is subtracting the mean for axis=0, just as you expect. This is why bn.moving_mean has shape (4,).


I know this post is old, but am still answering it because the confusion still lingers on in Keras documentation. I had to go through the code to figure this out:

  1. The axis variable which is documented as being an integer can actually be a list of integers denoting multiple axes. So for e.g. if my input had an image in the NHWC or NCHW formats, provide axis=[1,2,3] if I wanted to perform BatchNormalization in the way that the OP wants (i.e. normalize across the batch dimension only).
  2. The axis list (or integer) should contain the axes that you do not want to reduce while calculating the mean and variance. In other words it is the complement of the axes along which you want to normalize - quite opposite of what the documentation appears to say if you go by the conventional definition of 'axes'. So for e.g. if your input I was of shape (N,H,W,C) or (N,C,H,W) - i.e. the first dimension was the batch dimension and you only wanted the mean and variance to be calculated across the batch dimension you should supply axis=[1,2,3]. This will cause Keras to calculate mean M and variance V tensors of shape (1,H,W,C) or (1,C,H,W) respectively - i.e. batch dimension would get marginalized/reduced owing to the aggregation (i.e. mean or variance is calculated across the first dimension). In later operations like (I-M) and (I-M)/V, the first dimension of M and V would get broadcast to all of the N samples of the batch.
  3. The BatchNorm layer ends up calling tf.nn.moments with axes=(1,) in this example! That's so because the definition of axes in tf.nn.moments is the correct one.
  4. Similarly tf.nn.moments calls tf.nn.reduce_mean, where again the definition of axes is the correct one (i.e. opposite of tf.keras.layers.BatchNormalization).
  5. That said, the BatchNormalization paper suggests normalizing across the HxW spatial map in additon to the batch dimension (N). Hence if one were to follow that advice, then axis would only include the channel dimension (C) because that's the only remaining dimension that you didn't want to reduce. The Keras documentation is probably alluding to this, although it is quite cryptic.

if your mini-batch is a matrix A mxn, i.e. m samples and n features, the normalization axis should be axis=0. As your said, what we want is to normalize every feature individually, the default axis = -1 in keras because when it is used in the convolution-layer, the dimensions of figures dataset are usually (samples, width, height, channal), and the batch samples are normalized long the channal axis(the last axis).