What does `rep ret` mean?

As Trillian's answer points out, AMD K8 and K10 have a problem with branch prediction when ret is a branch target, or follow a conditional branch (as the fall-through target). That's because ret is only 1 byte long.

repz ret: why all the hassle? has some extra details about the specific micro-architectural reasons why that gives K8 and Barcelona a hard time.


Avoiding 1-byte ret as a possible branch target:

AMD's optimization guide for K10 (Barcelona) recommends 3-byte ret 0 in those cases, which pops zero bytes from the stack as well as returning. That version is significantly worse than rep ret on Intel. Ironically, it's also worse than rep ret on later AMD processors (Bulldozer and onwards.) So it's a good thing nobody changed to using ret 0 based on AMD's Family 10 optimization guide update.


The processor manuals warn that future processors could differently interpret a combination of a prefix and an instruction that it doesn't modify. That's true in theory, but nobody's going to make a CPU that can't run a lot of existing binaries.

gcc still uses rep ret by default (without -mtune=intel, or -march=haswell or something). So most Linux binaries have a repz ret in them somewhere.

gcc will probably stop using rep ret in a few years, once K10 is thoroughly obsolete. After another 5 or 10 years, almost all binaries will be built with a gcc newer than that. Another 15 years after that, a CPU manufacturer might think about repurposing the f3 c3 byte sequence as (part of) a different instruction.

There will still be legacy closed-source binaries using rep ret that don't have more recent builds available, and that someone needs to keep running, though. So whatever new feature f3 c3 != rep ret is part of would need to be disable-able (e.g. with a BIOS setting), and have that setting actually change the instruction-decoder behaviour to recognize f3 c3 as rep ret. If that backwards-compatibility for legacy binaries isn't possible (because it can't be done power efficiently in terms of power and transistors), IDK what kind of time-frame you'd be looking at. Much longer than 15 years, unless this was a CPU for only part of the market.

So it's safe to use rep ret, because everyone else is already doing it. Using ret 0 is a bad idea. In new code, it's may still a good idea to use rep ret for another couple years. There probably aren't too many AMD PhenomII CPUs still around, but they're slow enough without extra return-address mispredicts or w/e the problem is.


The cost is pretty small. It doesn't end up taking any extra space in most cases, because it's usually followed by nop padding anyway. However, in the cases where it does result in extra padding, it'll be the worst-case where 15B of padding is needed to reach the next 16B boundary. gcc may only align by 8B in that case. (with .p2align 4,,10; to align to 16B if it will take 10 or fewer nop bytes, then a .p2align 3 to always align to 8B. Use gcc -S -o- to produce asm output to stdout to see when it does this.)

So if we guesstimate that one in 16 rep ret end up creating extra padding where a ret would have just hit the desired alignment, and that the extra padding goes to an 8B boundary, this means each rep has an average cost of 8 * 1/16 = half a byte.

rep ret isn't used often enough to add up to much of anything. For example, firefox with all the libraries it has mapped is only has ~9k instances of rep ret. So that's about 4k bytes, across many files. (And less RAM than that, since many of those functions in dynamic libraries are never called.)

# disassemble every shared object mapped by a process.
ffproc=/proc/$(pgrep firefox)/
objdump -d "$ffproc/exe" $(sudo ls -l "$ffproc"/map_files/ |
       awk  '/\.so/ {print $NF}' | sort -u) |
       grep 'repz ret' -c
objdump: '(deleted)': No such file  # I forgot to restart firefox after the libexpat security update
9649

That counts rep ret in all the functions in all the libraries firefox has mapped, not just the functions it ever calls. This is somewhat relevant, because lower code density across functions means your calls are spread out over more memory pages. ITLB and L2-TLB only have a limited number of entries. Local density matters for L1I$ (and Intel's uop-cache). Anyway, rep ret has a very tiny impact.

It took me a minute to think of a reason that /proc/<pid>/map_files/ isn't accessible to the owner of the process, but /proc/<pid>/maps is. If a UID=root process (e.g. from a suid-root binary) mmap(2)s a 0666 file that's in a 0700 directory, then does setuid(nobody), anyone running that binary could bypass the access restriction imposed by the lack of x for other permission on the directory.


There's a whole blog named after this instruction. And the first post describes the reason behind it: http://repzret.org/p/repzret/

Basically, there was an issue in the AMD's branch predictor when a single-byte ret immediately followed a conditional jump as in the code you quoted (and a few other situations), and the workaround was to add the rep prefix, which is ignored by CPU but fixes the predictor penalty.


Apparently, some AMD processors' branch predictors behave badly when a branch's target or fallthrough is a ret instruction, and adding the rep prefix avoids this.

As to the meaning of rep ret, there is no mention of this instruction sequence in the Intel Instruction Set Reference, and the documentation of rep is not being very helpful:

The behavior of the REP prefix is undefined when used with non-string instructions.

This means at least that the rep doesn't have to behave in a repeating manner.

Now, from the AMD instruction set reference (1.2.6 Repeat Prefixes):

The prefixes should only be used with such string instructions.

In general, the repeat prefixes should only be used in the string instructions listed in tables 1-6, 1-7, and 1-8 above [which do not contain ret].

So it really seems like undefined behavior but one can assume that, in practice, processors just ignore rep prefixes on ret instructions.