Sharing GPU memory between process on a same GPU with Pytorch

The GPU itself has many threads. When performing an array/tensor operation, it uses each thread on one or more cells of the array. This is why it seems that an op that can fully utilize the GPU should scale efficiently without multiple processes -- a single GPU kernel is already massively parallelized.

In a comment you mentioned seeing better results with multiple processes in a small benchmark. I'd suggest running the benchmark with more jobs to ensure warmup, ten kernels seems like too small of a test. If you're finding a thorough representative benchmark to run faster consistently though, I'll trust good benchmarks over my intuition.

My understanding is that kernels launched on the default CUDA stream get executed sequentially. If you want them to run in parallel, I think you'd need multiple streams. Looking in the PyTorch code, I see code like getCurrentCUDAStream() in the kernels, which makes me think the GPU will still run any PyTorch code from all processes sequentially.

This NVIDIA discussion suggests this is correct:

Newer GPUs may be able to run multiple kernels in parallel (using MPI?) but it seems like this is just implemented with time slicing under the hood anyway, so I'm not sure we should expect higher total throughput:

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

If you do need to share memory from one model across two parallel inference calls, can you just use multiple threads instead of processes, and refer to the same model from both threads?

To actually get the GPU to run multiple kernels in parallel, you may be able to use nn.Parallel in PyTorch. See the discussion here: