GCC Assembly Optimizations - Why are these equivalent?

.cfi (call frame information) directives are used in gas (Gnu ASsembler) mainly for debugging. They allow the debugger to unwind the stack. To disable them, you can use the following parameter when you invoke the compilation driver -fno-asynchronous-unwind-tables.

If you want to play with the compiler in general, you can use the following compilation driver invocation command -o <filename.S> -S -masm=intel -fno-asynchronous-unwind-tables <filename.C> or just use godbolt's interactive compiler


Thank you, Kin3TiX, for asking an asm-newbie question that wasn't just a code-dump of some nasty code with no comments, and a really simple problem. :)

As a way to get your feet wet with ASM, I'd suggest working with functions OTHER than main. e.g. just a function that takes two integer args, and adds them. Then the compiler can't optimize it away. You can still call it with constants as args, and if it's in a different file from main, it won't get inlined, so you can even single-step through it.

There's some benefit to understanding what's going on at the asm level when you compile main, but other than embedded systems, you're only ever going to write optimized inner loops in asm. IMO, there's little point using asm if you aren't going to optimize the hell out of it. Otherwise you probably won't beat compiler output from source which is much easier to read.

Other tips for understanding compiler output: compile with
gcc -S -fno-stack-check -fverbose-asm. The comments after each instruction are often nice reminders of what that load was for. Pretty soon it degenerates into a mess of temporaries with names like D.2983, but something like
movq 8(%rdi), %rcx # a_1(D)->elements, a_1(D)->elements will save you a round-trip to the ABI reference to see which function arg comes in in %rdi, and which struct member is at offset 8.

See also How to remove "noise" from GCC/clang assembly output?


What do the lines spanning from .cfi_startproc to call__main even do?

    _main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5

.cfi stuff is stack-unwind info for debuggers (and C++ exception handling) to unwind the stack It won't be there if you look at asm from objdump -d output instead of gcc -S, or you can use -fno-asynchronous-unwind-tables.

The stuff with pushing %ebp and then setting it to the value of the stack pointer on function entry sets up what's called a "stack frame". This is why %ebp is called the base pointer. These insns won't be there if you compile with -fomit-frame-pointer, which gives code an extra register to work with. That's on by default at -O2. (This is huge for 32bit x86, since that takes you from 6 to 7 usable regs. (%esp is still tied up being the stack pointer; stashing it temporarily in an xmm or mmx reg and then using it as another GP reg is possible in theory, but compilers will never do that and it makes async stuff like POSIX signals or Windows SEH unusable, as well as making debugging harder.)

The leave instruction before the ret is also part of this stack frame stuff.

Frame pointers are mostly historical baggage, but do make offsets into the stack frame consistent. With debug symbols, you can backtrace the call stack just fine even with -fomit-frame-pointer, and it's the default for amd64. (The amd64 ABI has alignment requirements for the stack, is a LOT better in other ways, too. e.g. passes args in regs instead of on the stack.)

    andl    $-16, %esp
    subl    $16, %esp

The and aligns the stack to a 16-byte boundary, regardless of what it was before. The sub reserves 16 bytes on the stack for this function. (Notice how it's missing from the optimized version, because it optimizes away any need for memory storage of any variables.)

    call    ___main

__main (asm name = ___main) is part of cygwin: it calls constructor / init functions for shared libraries (including libc). On GNU/Linux, this is handled by _start (before main is reached) and even dynamic-linker hooks that let libc initialize itself before the executable's own _start is even reached. I've read that dynamic-linker hooks (or _start from a static executable) instead of code in main would be possible under Cygwin, but they simply choose not to do it that way.

(This old mailing list message indicates _main is for constructors, but that main shouldn't have to call it on platforms that support getting the startup code to call it.)

    movb    $5, 15(%esp)
    movb    $10, 14(%esp)
    movsbl  15(%esp), %edx
    movsbl  14(%esp), %eax
    addl    %edx, %eax
    leave
    ret

Why is the initial output of GCC so much more verbose?

Without optimizations enabled, gcc maps C statements as literally as possible into asm. Doing anything else would take more compile time. Thus, movb is from the initializers for your two variables. The return value is computed by doing two loads (with sign extension, because we need to upconvert to int BEFORE the add, to match the semantics of the C code as written, as far as overflow).

I cannot figure what the two subtraction operations are for.

There is only one sub instruction. It reserves space on the stack for the function's variables, before the call to __main. Which other sub are you talking about?

What do .section, .ident, .def .p2align, etc. etc. do?

See the manual for the GNU assembler. Also available locally as info pages: run info gas.

.ident and .def: Looks like gcc putting its stamp on the object file, so you can tell what compiler / assembler produced it. Not relevant, ignore these.

.section: determines what section of the ELF object file the bytes from all following instructions or data directives (e.g. .byte 0x00) go into, until the next .section assembler directive. Either code (read-only, shareable), data (initialized read/write data, private), or bss (block storage segment. zero-initialized, doesn't take any space in the object file).

.p2align: Power of 2 Align. Pad with nop instructions until the desired alignment. .align 16 is the same as .p2align 4. Jump instruction are faster when the target is aligned, because of instruction fetch in chunks of 16B, not crossing a page boundary, or just not crossing a cache-line boundary. (32B alignment is relevant when code is already in the uop cache of an Intel Sandybridge and later.) See Agner Fog's docs, for example.

The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added alot of "stuff" whose purpose I cannot discern.

Put the code of interest in a function by itself. A lot of things are special about main.

You are correct that a mov-immediate and a ret are all that's needed to implement the function, but gcc apparently doesn't have shortcuts for recognizing trivial whole-programs and omitting main's stack frame or the call to _main. >.<

Good question, though. As I said, just ignore all that crap and worry about just the small part you want to optimize.