what is the difference between Flatten() and GlobalAveragePooling2D() in keras

That both seem to work doesn't mean they do the same.

Flatten will take a tensor of any shape and transform it into a one dimensional tensor (plus the samples dimension) but keeping all values in the tensor. For example a tensor (samples, 10, 20, 1) will be flattened to (samples, 10 * 20 * 1).

GlobalAveragePooling2D does something different. It applies average pooling on the spatial dimensions until each spatial dimension is one, and leaves other dimensions unchanged. In this case values are not kept as they are averaged. For example a tensor (samples, 10, 20, 1) would be output as (samples, 1, 1, 1), assuming the 2nd and 3rd dimensions were spatial (channels last).


What a Flatten layer does

After convolutional operations, tf.keras.layers.Flatten will reshape a tensor into (n_samples, height*width*channels), for example turning (16, 28, 28, 3) into (16, 2352). Let's try it:

import tensorflow as tf

x = tf.random.uniform(shape=(100, 28, 28, 3), minval=0, maxval=256, dtype=tf.int32)

flat = tf.keras.layers.Flatten()

flat(x).shape
TensorShape([100, 2352])

What a GlobalAveragePooling layer does

After convolutional operations, tf.keras.layers.GlobalAveragePooling layer does is average all the values according to the last axis. This means that the resulting shape will be (n_samples, last_axis). For instance, if your last convolutional layer had 64 filters, it would turn (16, 7, 7, 64) into (16, 64). Let's make the test, after a few convolutional operations:

import tensorflow as tf

x = tf.cast(
    tf.random.uniform(shape=(16, 28, 28, 3), minval=0, maxval=256, dtype=tf.int32),
    tf.float32)


gap = tf.keras.layers.GlobalAveragePooling2D()

for i in range(5):
    conv = tf.keras.layers.Conv2D(64, 3)
    x = conv(x)
    print(x.shape)

print(gap(x).shape)
(16, 24, 24, 64)
(16, 22, 22, 64)
(16, 20, 20, 64)
(16, 18, 18, 64)
(16, 16, 16, 64)

(16, 64)

Which should you use?

The Flatten layer will always have at least as much parameters as the GlobalAveragePooling2D layer. If the final tensor shape before flattening is still large, for instance (16, 240, 240, 128), using Flatten will make an insane amount of parameters: 240*240*128 = 7,372,800. This huge number will be multiplied by the number of units in your next dense layer! At that moment, GlobalAveragePooling2D might be preferred in most cases. If you used MaxPooling2D and Conv2D so much that your tensor shape before flattening is like (16, 1, 1, 128), it won't make a difference. If you're overfitting, you might want to try GlobalAveragePooling2D.