1D Min-convolution in CUDA

An alternative which might be useful for large a and b would be to use a block per output entry in c. Using a block allows for memory coalescing, which will be important in what is a memory bandwidth limited operation, and a fairly efficient shared memory reduction can be used to combine per thread partial results into a final per block result. Probably the best strategy is to launch as many blocks per MP as will run concurrently and have each block emit multiple output points. This eliminates some of the scheduling overheads associated with launching and retiring many blocks with relatively low total instruction counts.

An example of how this might be done:

#include <math.h>

template<int bsz>
__global__ __launch_bounds__(512)
void minconv(const float *a, int sizea, const float *b, int sizeb, float *c)
{
    __shared__ volatile float buff[bsz];
    for(int i = blockIdx.x; i<(sizea + sizeb); i+=(gridDim.x*blockDim.x)) {
        float cval = INFINITY;
        for(int j=threadIdx.x; j<sizea; j+= blockDim.x) {
            int t = i - j;
            if ((t>=0) && (t<sizeb))
                cval = min(cval, a[j] + b[t]);
        }
        buff[threadIdx.x] = cval; __syncthreads();
        if (bsz > 256) {
            if (threadIdx.x < 256) 
                buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+256]);
            __syncthreads();
        }
        if (bsz > 128) {
            if (threadIdx.x < 128) 
                buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+128]); 
            __syncthreads();
        }
        if (bsz > 64) {
            if (threadIdx.x < 64) 
                buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+64]);
            __syncthreads();
        }
        if (threadIdx.x < 32) {
            buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+32]);
            buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+16]);
            buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+8]);
            buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+4]);
            buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+2]);
            buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+1]);
            if (threadIdx.x == 0) c[i] = buff[0];
        }
    }
}

// Instances for all valid block sizes.
template __global__ void minconv<64>(const float *, int, const float *, int, float *);
template __global__ void minconv<128>(const float *, int, const float *, int, float *);
template __global__ void minconv<256>(const float *, int, const float *, int, float *);
template __global__ void minconv<512>(const float *, int, const float *, int, float *);

[disclaimer: not tested or benchmarked, use at own risk]

This is single precision floating point, but the same idea should work for double precision floating point. For integer, you would need to replace the C99 INFINITY macro with something like INT_MAX or LONG_MAX, but the principle remains the same otherwise.


A faster version:

__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
    int i = (threadIdx.x + blockIdx.x * blockDim.x);
    int idT = threadIdx.x;
    int out,j;

    __shared__ double c_local [512];

    c_local[idT] = c[i];

    out = (i > sa) ? sa : i + 1;
    j   = (i > sb) ? i - sb + 1 : 1;

    for(; j < out; j++)
    {    
       if(c_local[idT] > a[j] + b[i-j])
          c_local[idT] = a[j] + b[i-j]; 
    }   

    c[i] = c_local[idT];
} 

**Benckmark:**
Size A Size B Size C Time (s)
1000   1000   2000   0.0008
10k    10k    20k    0.0051
100k   100k   200k   0.3436
1M     1M     1M     43,327

Old Version, For sizes between 1000 and 100000, I tested with this naive version:

__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
    int size = sa+sb;

    int idT = (threadIdx.x + blockIdx.x * blockDim.x);
    int out,j;


    for(int i = idT; i < size; i += blockDim.x * gridDim.x)
    {
        if(i > sa) out = sa;
        else out = i + 1;

        if(i > sb) j = i - sb + 1;
        else j = 1;


        for(; j < out; j++)
        {
                if(c[i] > a[j] + b[i-j])
                    c[i] = a[j] + b[i-j];
        }
    }
}

I populated the array a and b with some random double numbers and c with 999999 (just for testing). I validated the c array (in the CPU) using your function (without any modifications).

I also removed the conditionals from inside of the inner loop, so it will only test them once.

I am not 100% sure but I think the following modification makes sense. Since you had i - j >= 0, which is the same as i >= j, this means that as soon as j > i it will never enter this block 'X' (since j++):

if(c[i] > a[j] + b[i-j])
   c[i] = a[j] + b[i-j];

So I calculated on the variable out the loop conditional if i > sa, which means that the loop will finish when j == sa, if i < sa this means the loop will finish (earlier) on i + 1 because of the condition i >= j.

The other condition i - j < size(b) means that you will start the execution of the block 'X' when i > size(b) + 1 since j starts always = 1. So we can put j with the value that should begin, thus

if(i > sb) j = i - sb + 1;
else j = 1;

See if you can test this version with real arrays of data, and give me feedback. Also, any improvements are welcome.

EDIT : A new optimization can be implemented, but this one does not make much of a difference.

if(c[i] > a[j] + b[i-j])
    c[i] = a[j] + b[i-j];

we can eliminate the if, by:

double add;
...

 for(; j < out; j++)
 {
   add = a[j] + b[i-j];
   c[i] = (c[i] < add) * c[i] + (add <= c[i]) * add;
 }

Having:

if(a > b) c = b; 
else c = a; 

it the same of having c = (a < b) * a + (b <= a) * b.

if a > b then c = 0 * a + 1 * b; => c = b; if a <= b then c = 1*a + 0 *b; => c = a;

**Benckmark:**
Size A Size B Size C Time (s)
1000   1000   2000   0.0013
10k    10k    20k    0.0051
100k   100k   200k   0.4436
1M     1M     1M     47,327

I am measuring the time of copying from CPU to GPU, running the kernel and copying from GPU to CPU.

GPU Specifications   
Device                       Tesla C2050
CUDA Capability Major/Minor  2.0
Global Memory                2687 MB
Cores                        448 CUDA Cores
Warp size                    32

I have used you algorithm. I think it'll help you.

const int Length=1000;

__global__ void OneD(float *Ad,float *Bd,float *Cd){
    int i=blockIdx.x;
    int j=threadIdx.x;
    Cd[i]=99999.99;
    for(int k=0;k<Length/500;k++){
        while(((i-j)>=0)&&(i-j<Length)&&Cd[i+k*Length]>Ad[j+k*Length]+Bd[i-j]){
            Cd[i+k*Length]=Ad[j+k*Length]+Bd[i-j];
    }}}

I have taken 500 Threads per block. And, 500 blocks per Grid. As, the number of threads per block in my device is restricted to 512, I used 500 threads. I have taken the size of all the arrays as Length (=1000).

Working:

  1. i stores the Block Index and j stores the Thread Index.

  2. The for loop is used as the number of threads are less than the size of the arrays.

  3. The while loop is used for iterating Cd[n].

  4. I have not used Shared Memory because, I have taken lots of blocks and threads. So, the amount of Shared Memory required for each block is low.

PS: If your device supports more Threads and Blocks, replace k<Length/500 with k<Length/(supported number of threads)