Confused about GPU having hundreds of processors inside it

CPUs are SISD, GPUs are SIMD.

SISD is an acronym for Single Instruction, Single Data. CPUs are good in performing sequential operations: take this, do that, move it there, take another one, add them both together, write to a device, read response, and so on. They execute mostly simple operations that take one or two values and return one value.

SIMD is Single Instruction, Multiple Data: the same operation is executed on multiple datasets simultaneously. For example take 128 values [X1…X128], take 128 values [Y1…Y128], multiply corresponding values in pairs and return 128 results. A SISD processor would have to execute 128 instructions (+ memory reads/writes) because it can only multiply two numbers at once. SIMD processor does this in few steps or maybe even in one if only 128 numbers fit in its register.

SISD CPUs work well for everyday computing because it's mostly sequential, but there are some tasks that require crunching large amounts of data in a similar way - for example processing graphics, video rendering, cracking passwords, mining bitcoins etc. GPUs allow for massive parallelization of computing, provided that all data must be processed in the same way.

Okay, that's pure theory. In the real world regular CPUs offer some SIMD instructions (SSE), so some multiple data stuff can be done more efficiently on a regular CPU. At the same time not all ALUs in GPUs have to work on the same thing because they are grouped into batches (see Mokubai's answer). So CPUs are not purely SISD and GPUs are not purely SIMD.

When is using GPU for computations beneficial? When your calculations are really, really massively parallelizable. You have to consider that writing input into GPU's memory takes time and reading results takes some time too. You can get the biggest performance boost when you can build a processing pipeline that does a lot of calculations before leaving GPU.


A modern graphics processor is a highly complex device and can have thousands of processing cores. The Nvidia GTX 970 for example has 1664 cores. These cores are grouped into batches that work together.

For an Nvidia card the cores are grouped together in batches of 16 or 32 depending on the underlying architecture (Kepler or Fermi) and each core in that batch would run the same task.

The distinction between a batch and a core though is an important one because while each core in a batch must run the same task its dataset can be separate.

Your central processor unit is large and only has a few cores because it is a highly generalised processor capable of large scale decision making and flow control. The graphics card eschews a large amount of the control and switching logic in favour the ability to run a massive number of tasks in parallel.

If you insist on having a picture to prove it then the image below (from GTX 660Ti Direct CU II TOP review) shows 5 green areas which are largely similar and would contain several hundred cores each for a total of 1344 active cores split across what looks to me to be 15 functional blocks:

enter image description here

Looking closely each block appear to have 4 sets of control logic on the side suggesting that each of the 15 larger blocks you can see has 4 SMX units.

This gives us 15*4 processing blocks (60) with 32 cores each for a complete total of 1920 cores, batches of them will be disabled because they either malfunctioned or simply to facilitate segregating them into different performance groups. This would give us the correct number of active cores.

A good source of information on how the batches map together is on Stack Overflow: https://stackoverflow.com/questions/10460742/how-do-cuda-blocks-warps-threads-map-onto-cuda-cores


Graphical data is ideal for parallel processing. Divide a picture of 1024x1024 pixels into blocks of 16x16, and let each core process such a small block. Group the results back together and the result won't be different from one processor processing those blocks one by one.

The condition for this to work is that results of one core won't influence the results of the other cores, and vice versa. Something like this could work for an Excel sheet as well, where the cells in column C adds up the values of column A + B. C1=A1+B1, C2=A2+B2, and rows 1 and 2 are independent of eachother.

Graphic data processing is a highly specific task, and you can design a processor specifically for this kind of task - which can be used for other tasks as well, like mining bitcoins. And apparently you can make a processing unit more efficient by using many cores next to eachother instead of using one big processor. More efficient means not only faster, but also has the advantage that if you only need 20% of the processing cores, you can shut down the rest, which is energy efficient.

Disclaimer: the example above may not be technically correct. It's more to show the principle. Actual data processing will be a lot more complex I guess.