CUDA image processing

This is an extended comment on code style, rather than on your problem. Unfortunately I don't have a CUDA-capable card, so I can't test your actual issue.

Generally speaking, however, you will want to try and get away from procedural loops when writing Mathematica code:

  • Alternatives to procedural loops and iterating over lists in Mathematica
  • Why should I avoid the For loop in Mathematica?

For instance in your code, instead of:

Do[imagevar[count++] = Import[image], {image, images}];

you could use:

imagevar = Import/@images;

Note that this makes imagevar a list, rather than an indexed variable, but that will actually be helpful in this case, since you can carry out "vector" operations on lists quickly and cleanly. See e.g. Map, which is used above in its "shortcut" (infix) form /@.

Similarly, now that imagevar is a list, your code

For[i = 1, i < Length[images], i++, CUDAImageConvolve[imagevar[i], {{-1, -2, 3}}]]

can be converted to the following:

CUDAImageConvolve[#, {{-1, -2, 3}}]& /@ imagevar;

If you want more information on using pure functions (i.e. the # and & expressions), take a look at How to | Work with Pure Functions.

Timing might also be best done using AbsoluteTiming; see this section of the Date and Time Functions Tutorial. For instance, the following will directly return the time it took to carry out one instance of that operation:

AbsoluteTiming[CUDAImageConvolve[#, {{-1, -2, 3}}]& /@ imagevar;]

In this context, you might also be interested in RepeatedTiming for more accurate timing results.

Similarly with the last bit, to give you timing of the non-CUDA operation:

AbsoluteTiming[ImageConvolve[#, {{-1, -2, 3}}]& /@ imagevar;]

CUDAImageConvolve is slow due to memory tranfert between the different memories (CPU and GPU). One can decompose the action of CUDAImageConvolve in 6 phases and get de timing of each phase.

Here are the results, compared with the CPU ImageConvolve and the straight CUDAImageConvolve :

enter image description here

One can see that the convolution alone takes only 16 mS.

This information can be usefull if you have a sequence of operations to do without leaving the GPU memory.

Here is the code :

Needs["CUDALink`"]
CUDAQ[]

imagevar = ExampleData /@ {{"TestImage","House2"},
{"TestImage","Lena"},{"TestImage","Mandrill"}}


AbsoluteTiming[ImageConvolve[#, {{-1, -2, 3}}]& /@ imagevar]
AbsoluteTiming[CUDAImageConvolve[#, {{-1, -2, 3}}]& /@ imagevar]

enter image description here

Initialization :

AbsoluteTiming[memList=CUDAMemoryLoad[ImageData[#]]& /@ imagevar;]
AbsoluteTiming[CUDAMemoryCopyToDevice /@ memList;]
(*CUDAMemoryInformation[memList[[1]]]*)
AbsoluteTiming[outMemList=CUDAMemoryAllocate["Double",Append[ImageDimensions[#],3]]& /@ imagevar;]

{0.0467458, Null}
{0.0143745, Null}
{0.000671206, Null}

convolution :

AbsoluteTiming[(CUDAImageConvolve[#[[1]], {{-1, -2, 3}},"OutputMemory"-> #[[2]]])& /@ Transpose[{memList,outMemList}];]  

{0.0151647, Null}

restitution :

AbsoluteTiming[Image[CUDAMemoryGet[#]]& /@ outMemList]
AbsoluteTiming[CUDAMemoryUnload /@ Flatten[{memList,outMemList}];]

enter image description here

NOTE :

CUDAMemoryLoad[ImageData[#]]& /@ imagevaris used instead of CUDAMemoryLoad /@ imagevar because CUDAMemoryLoad[image] create a 2D array of {Byte,Byte,Byte} even if the image is a 2D array of {Float,Float,Float}. The consequence is that the convolution creates a overflow on the Bytes, and the resulting image is then :

enter image description here