How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/ (Intel only), or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.

On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.

Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.

Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).

lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).

(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)

The pseudocode in the manual is not the implementation

(i.e. it's not necessarily how hardware or microcode works).

It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.

The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).

Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.

The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.

Abstract

A merged datapath for PopCount and BitScan is described. A hardware circuit includes a compressor tree utilized for a PopCount function, which is reused by a BitScan function (e.g., bit scan forward (BSF) or bit scan reverse (BSR)).

Selector logic enables the compressor tree to operate on an input word for the PopCount or BitScan operation, based on a microprocessor instruction. The input word is encoded if a BitScan operation is selected.

The compressor tree receives the input word, operates on the bits as though all bits have same level of significance (e.g., for an N-bit input word, the input word is treated as N one-bit inputs). The result of the compressor tree circuit is a binary value representing a number related to the operation performed (the number of set bits for PopCount, or the bit position of the first set bit encountered by scanning the input word).

It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.

AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.

It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize latencies to simplify the scheduler and reduce the opportunities write-back conflicts (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each, but the scheduler schedules uops, not instructions).

One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.

On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.

Is there a penalty when base+offset is in a different page than the base?
Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.

Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").

I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.

You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.

As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)

Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?

Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?

The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things. One way you could implement this feature in hardware is by using a Priority encoder.

A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.

input |  output  index of first set bit 
0000  |  xx      undefined
0001  |  00      0
001x  |  01      1
01xx  |  10      2
1xxx  |  11      3

x denotes the bit value does not matter and can be anything

If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

The pseudocode in the manual is not the implementation

Tags:

Performance

Assembly

X86

Intel

Cpu Architecture

Related

Recent Posts