How does Pytorch Dataloader handle variable size data?

This is the way I do it:

def collate_fn_padd(batch):
    '''
    Padds batch of variable length

    note: it converts things ToTensor manually here since the ToTensor transform
    assume it takes in images rather than arbitrary tensors.
    '''
    ## get sequence lengths
    lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)
    ## padd
    batch = [ torch.Tensor(t).to(device) for t in batch ]
    batch = torch.nn.utils.rnn.pad_sequence(batch)
    ## compute mask
    mask = (batch != 0).to(device)
    return batch, lengths, mask

then I pass that to the dataloader class as a collate_fn.


There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.

It would be nice that the ideal answer mentions

  • efficiency, e.g. if to do the processing in GPU with torch in the collate function vs numpy

things of that sort.

List:

  • https://discuss.pytorch.org/t/how-to-create-batches-of-a-list-of-varying-dimension-tensors/50773
  • https://discuss.pytorch.org/t/how-to-create-a-dataloader-with-variable-size-input/8278
  • https://discuss.pytorch.org/t/using-variable-sized-input-is-padding-required/18131
  • https://discuss.pytorch.org/t/dataloader-for-various-length-of-data/6418
  • https://discuss.pytorch.org/t/how-to-do-padding-based-on-lengths/24442

bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284


So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.