Why not make one big CPU core?

The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.

To make a CPU do more, you have to plan what doing more entails. There are really three options:

  1. Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.

    Power usage and hence thermal dissipation increases with frequency - if you double the frequency you nominally double the power dissipation. If you increase the voltage your power dissipation goes up with the square of voltage.

    Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.

    We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.


  1. Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.

    This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.

    The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.

    These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.

    You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.


  1. Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.

    This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.

    Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.

    As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.


In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.

By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.

For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.

And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.

With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.


Data dependency

It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.

There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.

A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."

The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.

Power management

Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.

Rewriting the software is the only way forwards

The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.

Tags:

Cpu