How can torch multiply two 10000*10000 matrices in almost zero time? Why does the speed change so much from 349 ms down to 999 µs?

There is already a discussion about this on Discuss PyTorch: Measuring GPU tensor operation speed.

I'd like to highlight two comments from that thread:

  • From @apaszke:

[...] the GPU executes all operations asynchronously, so you need to insert proper barriers for your benchmarks to be correct

  • From @ngimel:

I believe cublas handles are allocated lazily now, which means that first operation requiring cublas will have an overhead of creating cublas handle, and that includes some internal allocations. So there’s no way to avoid it other than calling some function requiring cublas before the timing loop.


Basically, you have to synchronize() to have a proper measurement:

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU times: user 288 ms, sys: 191 ms, total: 479 ms

Wall time: 492 ms

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU times: user 237 ms, sys: 231 ms, total: 468 ms

Wall time: 469 ms


Docs say:

torch.cuda.synchronize()

Waits for all kernels in all streams on a CUDA device to complete.

In fact, this tells Python: stop, and wait until the operation fully finished.

Otherwise, the %time returns immediately after issuing a command.

This would be the correct way to test the time. Note two times torch.cuda.synchronize() first one to wait for the tensors to move on cuda, and second to wait until the command completes on GPU.

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
torch.cuda.synchronize()

%timeit -n 10 y = x.matmul(w.t()); torch.cuda.synchronize() #10 loops, best of 3: 531 ms per loop