pytorch DataLoader extremely slow first epoch

Slavka,

TLDR: This is a caching effect.

I did not download the whole GLR2020 dataset but I was able to observe this effect on the image dataset that I had locally (80000 jpg images of approx 400x400 size).

To find the reasons for the difference in performance I tried the following:

  1. reducing the augmentation to just resizing
  2. testing just ImgDataset.__getitem__() function
  3. ImgDataset.__getitem__() without augmentation
  4. just loading the raw jpg image and passing it from the dataset without even numpy conversion.

It turns out that the difference comes from the image loading timing. Python (or OS itself) implements some kind of caching which is observed when loading image multiple times in the following test.

for i in range(5):    
    t0 = time.time()
    data = cv2.imread(filename)
    print (time.time() - t0)
    
0.03395271301269531
0.0010004043579101562
0.0010004043579101562
0.0010008811950683594
0.001001119613647461

same is observed when just reading from file to variable

for i in range(5):    
    t0 = time.time()
    with open(filename, mode='rb') as file: 
        data = file.read()
    print (time.time() - t0)

0.036234378814697266
0.0028831958770751953
0.0020024776458740234
0.0031833648681640625
0.0028734207153320312

One way to reduce the loading speed is to keep the data on very fast local SSD. If size allows, try loading part of the dataset into RAM and writing custom dataloader to feed from there...

BTW Based on my findings this effect should be reproducible with any dataset - see if you used different drives or some caching.


It appears that the OS is caching IO access to the dataset. To check if this is definitely the problem, try running sync; echo 3 > /proc/sys/vm/drop_caches (on Ubuntu) after the first epoch. If the second epoch is equally slow when you do this, then it is the caching which is making the subsequent reads so much faster.

If you are using a HDD, then you may get significant speed improvements for your first epoch by co-locating all of your small image files on disk.

You can use SquashFS (it comes pre-installed with Ubuntu) to compress your whole dataset into single file, then mount that file as a directory and access it just as you were before (except now the images are co-located on disk). The mounted directory is read-only.

e.g.

mksquashfs /path/to/data data.sqsh
mount data.sqsh /path/to/data_sqsh -t squashfs -o loop

Then you can use /path/to/data_sqsh in precisely the same way you used /path/to/data. You will have to re-mount it when you restart your computer

See: https://tldp.org/HOWTO/SquashFS-HOWTO/creatingandusing.html