how to normalize input data for models in tensorflow

There are different ways of "normalizing data". Depending which one you have in mind, it may or may not be easy to implement in your case.

1. Fixed normalization

If you know the fixed range(s) of your values (e.g. feature #1 has values in [-5, 5], feature #2 has values in [0, 100], etc.), you could easily pre-process your feature tensor in parse_example(), e.g.:

def normalize_fixed(x, current_range, normed_range):
    current_min, current_max = tf.expand_dims(current_range[:, 0], 1), tf.expand_dims(current_range[:, 1], 1)
    normed_min, normed_max = tf.expand_dims(normed_range[:, 0], 1), tf.expand_dims(normed_range[:, 1], 1)
    x_normed = (x - current_min) / (current_max - current_min)
    x_normed = x_normed * (normed_max - normed_min) + normed_min
    return x_normed

def parse_example(line_batch, 
                  fixed_range=[[-5, 5], [0, 100], ...],
                  normed_range=[[0, 1]]):
    # ...
    features = tf.transpose(features)
    features = normalize_fixed(features, fixed_range, normed_range)
    # ...

2. Per-sample normalization

If your features are supposed to have approximately the same range of values, per-sample normalization could also be considered, i.e. applying normalization considering the features moments (mean, variance) for each sample:

def normalize_with_moments(x, axes=[0, 1], epsilon=1e-8):
    mean, variance = tf.nn.moments(x, axes=axes)
    x_normed = (x - mean) / tf.sqrt(variance + epsilon) # epsilon to avoid dividing by zero
    return x_normed

def parse_example(line_batch):
    # ...
    features = tf.transpose(features)
    features = normalize_with_moments(features)
    # ...

3. Batch normalization

You could apply the same procedure over a complete batch instead of per-sample, which may make the process more stable:

data_batch = normalize_with_moments(data_batch, axis=[1, 2])

Similarly, you could use tf.nn.batch_normalization

4. Dataset normalization

Normalizing using the mean/variance computed over the whole dataset would be the trickiest, since as you mentioned it is a large, split one.

tf.data.Dataset isn't really meant for such global computation. A solution would be to use whatever tools you have to pre-compute the dataset moments, then use this information for your TF pre-processing.


As mentioned by @MiniQuark, Tensorflow has a Transform library you could use to preprocess your data. Have a look at the Get Started, or for instance at the tft.scale_to_z_score() method for sample normalization.


Exapnding on benjaminplanche's answer for "#4 Dataset normalization", there is actually a pretty easy way to accomplish this.

Tensorflow's Keras provides a preprocessing normalization layer. Now as this is a layer, its intent is to be used within the model. However you don't have to (more on that later).

The model usage is simple:

input = tf.keras.Input(shape=dataset.element_spec.shape)
norm = tf.keras.layers.preprocessing.Normalization()
norm.adapt(dataset) # you can use dataset.take(N) if N samples is enough for it to figure out the mean & variance.
layer1 = norm(input)
...

The advantage of using it in the model is that the normalization mean & variance are saved as part of the model weights. So when you load the saved model, it'll use the same values it was trained with.

 

As mentioned earlier, if you don't want to use keras models, you don't have to use the layer as part of one. If you'd rather use it in your dataset pipeline, you can do that too.

norm = tf.keras.layers.experimental.preprocessing.Normalization()
norm.adapt(dataset)
dataset = dataset.map(lambda t: norm(t))

The disadvantage is that you need to save and restore those weights manually now (norm.get_weights() and norm.set_weights()). Numpy has convenient save() and load() functions you can use here.

np.save("norm_weights.npy", norm.get_weights())
norm.set_weights(np.load("norm_weights.npy", allow_pickle=True))

After defining inputs, execute the following line of code:

import tensorflow as tf
inputs = tf.keras.layers.LayerNormalization(
    axis=-1,
    center=True,
    scale=True,
    trainable=True,
    name='input_normalized',
)(inputs)

I inferred this from the tensorflow API (which has been updated since the answers above).

Tags:

Tensorflow