CUDA Dot Product

Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:

#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>

double do_dot_product(int n, double *a, double *b)
{
  // wrap raw pointers to device memory with device_ptr
  thrust::device_ptr<double> d_a(a), d_b(b);

  // inner_product implements a mathematical dot product
  return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}

You are using the cuda_atomicAdd function incorrectly. This section of your kernel:

if (cacheIndex==0) {
    *dot_res=cuda_atomicAdd(dot_res,cache[0]);
}

is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:

if (cacheIndex==0) {
    double result=cuda_atomicAdd(dot_res,cache[0]);
}