Is it possible to access hard disk directly from gpu?

Trying to use this feature, I wrote a small example on Windows x64 to implement this. In this example, kernel "directly" accesses disk space. Actually, as @RobertCrovella mentioned previously, the operating system is doing the job, with probably some CPU work; but no supplemental coding.

__global__ void kernel(int4* ptr)
{
    int4 val ; val.x = threadIdx.x ; val.y = blockDim.x ; val.z = blockIdx.x ; val.w = gridDim.x ;
    ptr[threadIdx.x + blockDim.x * blockIdx.x] = val ;
    ptr[160*1024*1024 + threadIdx.x + blockDim.x * blockIdx.x] = val ;
}

#include "Windows.h"

int main()
{
    // 4GB - larger than installed GPU memory
    size_t size = 256 * 1024 * 1024 * sizeof(int4) ;

    HANDLE hFile = ::CreateFile ("GPU.dump", (GENERIC_READ | GENERIC_WRITE), 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL) ;

    HANDLE hFileMapping = ::CreateFileMapping (hFile, 0, PAGE_READWRITE, (size >> 32), (int)size, 0) ;

    void* ptr = ::MapViewOfFile (hFileMapping, FILE_MAP_ALL_ACCESS, 0, 0, size) ;

    ::cudaSetDeviceFlags (cudaDeviceMapHost) ;

    cudaError_t er = ::cudaHostRegister (ptr, size, cudaHostRegisterMapped) ;
    if (cudaSuccess != er)
    {
        printf ("could not register\n") ;
        return 1 ;
    }

    void* d_ptr ;
    er = ::cudaHostGetDevicePointer (&d_ptr, ptr, 0) ;
    if (cudaSuccess != er)
    {
        printf ("could not get device pointer\n") ;
        return 1 ;
    }

    kernel<<<256,256>>> ((int4*)d_ptr) ;

    if (cudaSuccess != ::cudaDeviceSynchronize())
    {
        printf ("error in kernel\n") ;
        return 1 ;
    }

    if (cudaSuccess != ::cudaHostUnregister (ptr))
    {
        printf ("could not unregister\n") ;
        return 1 ;
    }

    ::UnmapViewOfFile (ptr) ;

    ::CloseHandle (hFileMapping) ;
    ::CloseHandle (hFile) ; 

    ::cudaDeviceReset() ;

    printf ("DONE\n");

    return 0 ;
}

For anyone else looking for this, 'lazy unpinning' did more or less what I want.

Go through the following to see if this can be helpful for you.

The most straightforward implementation using RDMA for GPUDirect would pin memory before each transfer and unpin it right after the transfer is complete. Unfortunately, this would perform poorly in general, as pinning and unpinning memory are expensive operations. The rest of the steps required to perform an RDMA transfer, however, can be performed quickly without entering the kernel (the DMA list can be cached and replayed using MMIO registers/command lists).

Hence, lazily unpinning memory is key to a high performance RDMA implementation. What it implies, is keeping the memory pinned even after the transfer has finished. This takes advantage of the fact that it is likely that the same memory region will be used for future DMA transfers thus lazy unpinning saves pin/unpin operations.

An example implementation of lazy unpinning would keep a set of pinned memory regions and only unpin some of them (for example the least recently used one) if the total size of the regions reached some threshold, or if pinning a new region failed because of BAR space exhaustion (see PCI BAR sizes).

Here is a link to an application guide and to nvidia docs.


The real solution is on the horizon!

Early access: https://developer.nvidia.com/gpudirect-storage

GPUDirect® Storage (GDS) is the newest addition to the GPUDirect family. GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.