How good is OpenCV GPU library for matrix operations?

I find ArrayFire to be much faster and have started using it instead of the GPU kernels in OpenCV for image processing. Here are some benchmarks I found comparing ArrayFire (used to be in a different interface called LibJacket) to OpenCV and it's been true in my benchmarking too that ArrayFire is 2-4X faster than the GPU functions in OpenCV. From what I hear, NVIDIA didn't write the GPU kernels in OpenCV but contracted those out to someone, which may be why they are so slow. Since I'm only using 1 GPU, I can use ArrayFire for free.

Update, given the new MATLAB code posted by @Alex: I ran the benchmark of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the CPU, but Jacket and ArrayFire kick butt. HW specs are:

Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
NVIDIA Tesla M2090

Results of CPU vs GPU using Parallel Computing Toolbox gpuArray (fully warmed up). CPU is faster than PCT gpuArray:

>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.

Results of CPU vs GPU using Jacket (fully warmed up). Jacket beats PCT gpuArray by 3.7X and beats the CPU by 3X

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.

Here is the modified code that let's you run all that easily:

function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

[nP, d] = size(P);
[nQ, d] = size(Q);

pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);

K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';

end

Jacket does support BSXFUN on the GPU, and it does improve the speeds somewhat:

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.

Note that the sizes used here are pretty small, so most CUDA code that attempts to run on these small sizes is likely to perform poorly. That's why I like to use AccelerEyes' stuff, because those guys have optimized the heck out of the GPU, unlike PCT gpuArray, Thrust, OpenCV, each of which I've tried in the past.

Here is the ArrayFire Free C++ results:

Time:  0.0003577 seconds
Speedups:  19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN

Here is the ArrayFire code I wrote for this:

static array SqEuclideanDist(array P, array Q)
{
    // 0 based indexing
    array pmag = sum(P * P, 1);
    array qmag = sum(Q * Q, 1);

    int np = P.dims(0);
    int nq = Q.dims(0);

    array K = tile(qmag.T(), np, 1) + tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
    return K;
}

int main(int argc, char **argv)
{
    double *P_cpu = new double[1581 * 3];
    double *Q_cpu = new double[189 * 3];

    array P = array(1581, 3, P_cpu);
    array Q = array(189 , 3, Q_cpu);
    af::sync();

    int iter = 1000;

    timer::tic();
    for (int i = 0; i < iter; i++) {
        array K = SqEuclideanDist(P, Q);
        af::eval(K);
    }

    af::sync();
    printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);

    delete[] P_cpu;
    delete[] Q_cpu;
}