How can an FPGA outperform a CPU?

CPU's are sequential processing devices. They break an algorithm up into a sequence of operations and execute them one at a time.

FPGA's are (or, can be configured as) parallel processing devices. An entire algorithm might be executed in a single tick of the clock, or, worst case, far fewer clock ticks than it takes a sequential processor. One of the costs to the increased logic complexity is typically a lower limit at which the device can be clocked.

Bearing the above in mind, FPGA's can outperform CPU's doing certain tasks because they can do the same task in less clock ticks, albeit at a lower overall clock rate. The gains that can be achieved are highly dependent on the algorithm, but at least an order of magnitude is not atypical for something like an FFT.

Further, because you can build multiple parallel execution units into an FPGA, if you have a large volume of data that you want to pass through the same algorithm, you can distribute the data across the parallel execution units and obtain further orders of magnitude higher throughput than can be achieved with even a multi-core CPU.

The price you pay for the advantages is power consumption and $$$'s.


Markt has this mostly right, but I'm going to throw in my 2 cents here:

Imagine that I told you that I wanted to write a program which reversed the order of bits inside of a 32-bit integer. Something like this:

int reverseBits(int input) {
    output = 0;
    for(int i = 0;i < 32;i++) {
        // Check if the lowest bit is set
        if(input & 1 != 0) {
            output = output | 1; // set the lowest bit to match in the output!
        }

        input = input >> 1;
        output = output << 1;
    }
    return output;
}

Now my implementation is not elegant, but I'm sure you agree that there would be some number of operations involved in doing this, and probably some sort of loop. This means that in the CPU, you have spent many more than 1 cycle to implement this operation.

In an FPGA, you can simply wire this up as a pair of latches. You get your data into some register, then you wire it into the different register in reverse bit order. This means that the operation will complete in a single clock cycle in the FPGA. Thus, in a single cycle, the FPGS has completed an operation that took your general purpose CPU many thousands of cycles to complete! In addition, you can wire up probably a few hundred of these registers in parallel. So if you can move in a few hundred numbers onto the FPGA, in a single cycle it will finish those thousands of operations hundreds of times over, all in 1 FPGA clock cycle.

There are many things which a general purpose CPU can do, but as a limitation, we set up generalized and simple instructions which necessarily have to expand into lists of simple instructions to complete some tasks. So I could make the general purpose CPU have an instruction like "reverse bit order for 32 bit register" and give the CPU the same capability as the FPGA we just built, but there are an infinite number of such possible useful instructions, and so we only put in the ones which warrant the cost in the popular CPUs.

FPGAs, CPLDs, and ASICs all give you access to the raw hardware, which allows you to define crazy operations like "decrypt AES256 encrypted bytes with key" or "decode frame of h.264 video". These have latencies of more than one clock cycle in an FPGA, but they can be implemented in much more efficient manners than writing out the operation in millions of lines of general purpose assembly code. This also has the benefit of making the fixed-purpose FPGA/ASIC for many of these operations more power-efficient because they don't have to do as much extraneous work!

Parallelism is the other part which markt pointed out, and while that is important as well, the main thing is when an FPGA parallelizes something which was already expensive in the CPU in terms of cycles needed to perform the operation. Once you start saying "I can perform in 10 FPGA cycles a task which takes my CPU 100,000 cycles, and I can do this task in parallel 4 items at a time," you can easily see why an FPGA could be a heck of a lot faster than a CPU!

So why don't we use FPGAs, CPLDs, and ASICs for everything? Because in general it is a whole chip which does nothing but one operation. This means that although you can get a process to run many orders of magnitude faster in your FPGA/ASIC, you can't change it later when that operation is no longer useful. The reason you can't (generally) change an FPGA once it's in a circuit is that the wiring for the interface is fixed, and normally the circuit doesn't include components which would allow you to repgrogram the FPGA into a more useful configuration. There are some researchers trying to build hybrid FPGA-CPU modules, where there is a section of the CPU which is capable of being rewired/reprogrammed like an FPGA, allowing you to "load" an effective section of the CPU, but none of these have ever made it to market (as far as I'm aware).


All of the other popular answers presented here talk about literal differences between FPGAs and CPUs. They point out the parallel nature of the FPGA vs the sequential nature of a CPU, or give examples of why certain algorithms might work well on an FPGA. All of those are good and true, but I would suggest however that there is a more fundamental difference between CPUs and FPGAs.

What’s the common denominator between an FPGA and a CPU? It is that they are both built on top of silicon. And in some cases literally the same silicon processes.

The fundamental difference is the abstractions that we pile on top of that silicon. It’s not possible for one human to understand the full detail of a single modern CPU design from silicon to packaged IC. So as part of the engineering process we divide that complex problem into smaller manageable problems that humans can wrap their heads around.

Consider what it takes to turn that silicon into a functioning CPU. Here’s a somewhat simplified view of the layers of abstraction necessary for that goal:

  1. First we have engineers who know how to create transistors from silicon. They know how to design tiny transistors that sip power and switch at the rate of 10’s or even 100’s of gigahertz, and they know how to design beefy transistors that can drive signals with enough power to send them out of an IC package and across a PCB to another chip.

  2. Then we have digital logic designers who know how to put those transistors together to libraries with hundreds of different logic cells. Logic gates, flip flops, muxes, and adders, to name a few. All in a variety of configurations.

  3. Next we have various groups of engineers who know how to put those digital (and sometimes analog) blocks together to form higher level functional blocks as as high speed transceivers, memory controllers, branch predictors, ALUs, etc.

  4. Then we have CPU designers to architect high end CPU designs by pulling together those functional units into a complete system.

And it doesn’t stop there. At this point we have a working CPU which runs assembly code but that’s not a language most programmers write to these days.

  1. We might have a C compiler to that compiles to assembly code (probably through some intermediate representation)
  2. We could add another abstraction on top of C to get an object oriented language
  3. We might even write a Virtual machine on top of C or C++ so that we can interpret things like Java byte code

And the abstraction layers can go on from there. The important point here is that those abstraction layers combine to yield a CPU based system that scales massively and costs a tiny fraction of a custom silicon design.

HOWEVER, the important point to be made here is that each abstraction also carries a cost itself. The transistor designer doesn’t build the perfect transistor for every use case. He builds a reasonable library, and so sometimes a transistor is used that consumes a little more power or a little more silicon than is really needed for the job at hand. And similarly the logic designers don’t build every possible logic cell. They might build a 4 input NAND gate and a 8 input NAND gate but what happens when another engineer needs a 6 input NAND? He uses an 8 input NAND gate and ties off 2 unused inputs which results in lost silicon resources and waisted power. And so it goes up the chain of abstractions. Each layer giving us a way to handle the complexity, but at the same time charging us an additional incremental cost in terms of silicon and power.

Now compare those abstractions to what is needed for an FPGA. Essentially, the FPGA abstractions stop at #2 in the list above. The FPGA allows developers to work at the digital logic layer. It’s somewhat more sophisticated than that because CPUs are ‘hard coded’ at this layer and FPGAs must be configured at run time (which, BTW, is why CPUs typically run a much higher frequencies), but the essential important truth is that that are far few abstractions for FPGAs than for CPUs.

So, Why can an FPGA be faster than an CPU? In essence it’s because the FPGA uses far fewer abstractions than a CPU, which means the designer works closer to the silicon. He doesn’t pay the costs of all the many abstraction layers which are required for CPUs. He codes at a lower level and has to work harder to achieve a given bit of functionality but the reward he gets higher performance.

But of course there is a down side for fewer abstractions as well. All those CPU abstractions are there for good reason. They give us a much simpler coding paradigm which means more people can easily develop for them. That in turn means that there are many more CPU designs in existence and thus we have massive price/scale/time-to-market benefits from CPUs.

So there you have it. FPGAs have fewer abstractions and so they can be faster and more power efficient but difficult to program for. CPUs have many abstractions design to make them easy to develop for, scalable, and cheap. But they give up speed and power in trade for those benefits.

Tags:

Cpu

Fpga