Parallel Reduction

Basically, it is performing the operation shown in the picture below:

enter image description here

This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.

You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]:

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;  

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];

So:

thread 0 of block = 0  will copy the position 0 and 512,  
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047

The blockDim.x*2 is used because each thread will access to position i and i+blockDim.x so you need to multiple by 2 to guarantee that the threads on next id block do not compute the position of g_idata already computed.