Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

Code

import tensorflow as tf
def shuffle():
    ds = list(range(0,1000))
    dataset = tf.data.Dataset.from_tensor_slices(ds)
    dataset=dataset.shuffle(buffer_size=500)
    dataset = dataset.batch(batch_size=1)
    iterator = dataset.make_initializable_iterator()
    next_element=iterator.get_next()
    init_op = iterator.initializer
    with tf.Session() as sess:
        sess.run(init_op)
        for i in range(100):
            print(sess.run(next_element), end='')

shuffle()

Output

[298][326][2][351][92][398][72][134][404][378][238][131][369][324][35][182][441][370][372][144][77][11][199][65][346][418][493][343][444][470][222][83][61][81][366][49][295][399][177][507][288][524][401][386][89][371][181][489][172][159][195][232][160][352][495][241][435][127][268][429][382][479][519][116][395][165][233][37][486][553][111][525][170][571][215][530][47][291][558][21][245][514][103][45][545][219][468][338][392][54][139][339][448][471][589][321][223][311][234][314]


TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.


The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background. (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)

Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.

By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.



Importance of buffer_size in shuffle()

I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().

Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.


A practical example: cat classifier

Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):

train/
    cat/
        filename_00001.jpg
        filename_00002.jpg
        ...
    not_cat/
        filename_10001.jpg
        filename_10002.jpg
        ...

A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:

filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
             "filename_10001.jpg", "filename_10002.jpg", ...]
labels = [1, 1, ..., 0, 0...]  # 1 for cat, 0 for not_cat

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(buffer_size=1000)  # 1000 should be enough right?
dataset = dataset.map(...)  # transform to images, preprocess, repeat, batch...

The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.
At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.

The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).

Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(buffer_size=len(filenames)) 
dataset = dataset.map(...)  # transform to images, preprocess, repeat, batch...

The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).