Does stack space required by a function affect inlining decisions in C/C++?

Yes, the decision to inline or not depends on the complexity of the function, its stack and registers usage and the context in which the call is made. The rules are compiler- and target platform-dependent. Always check the generated assembly when performance matters.

Compare this version with a 10000-char array not being inlined (GCC 8.2, x64, -O2):

inline int inlineme(int args) {
  char svar[10000];

  return stringyfunc(args, svar);
}

int test(int x) {
    return inlineme(x);
}

Generated assembly:

inlineme(int):
        sub     rsp, 10008
        mov     rsi, rsp
        call    stringyfunc(int, char*)
        add     rsp, 10008
        ret
test(int):
        jmp     inlineme(int)

with this one with a much smaller 10-char array, which is inlined:

inline int inlineme(int args) {
  char svar[10];

  return stringyfunc(args, svar);
}

int test(int x) {
    return inlineme(x);
}

Generated assembly:

test(int):
        sub     rsp, 24
        lea     rsi, [rsp+6]
        call    stringyfunc(int, char*)
        add     rsp, 24
        ret

Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?

Not necessarily in general. In fact, inline expansion can sometimes reduce stack space usage due to not having to set up space for function arguments.

Expanding a "wide" call into a single frame which calls other "wide" functions can be a problem though, and unless the optimiser guards against that separately, it may have to avoid expansion of "wide" functions in general.

In case of recursion: Most likely yes.

An example of LLVM source:

if (IsCallerRecursive &&
         AllocatedSize > InlineConstants::TotalAllocaSizeRecursiveCaller) {
       InlineResult IR = "recursive and allocates too much stack space";

From GCC source:

For stack growth limits we always base the growth in stack usage of the callers. We want to prevent applications from segfaulting on stack overflow when functions with huge stack frames gets inlined.

Controlling the limit, from GCC manual:

--param name=value

large-function-growth

  • Specifies maximal growth of large function caused by inlining in percents. For example, parameter value 100 limits large function growth to 2.0 times the original size.

large-stack-frame

  • The limit specifying large stack frames. While inlining the algorithm is trying to not grow past this limit too much.

large-stack-frame-growth

  • Specifies maximal growth of large stack frames caused by inlining in percents. For example, parameter value 1000 limits large stack frame growth to 11 times the original size.

Yes, partly because compilers do stack allocation for the whole function once in prologue/epilogue, not moving the stack pointer around as they enter/leave block scopes.


and each inlined call to inlineme() would need its own buffer.

No, I'm pretty sure compilers are smart enough to reuse the same stack space for different instances of the same function, because only one instance of that C variable can ever be in-scope at once.

Optimization after inlining can merge some of the operations of the inline function into calling code, but I think it would be rare for the compiler to end up with 2 versions of the array it wanted to keep around simultaneously.

I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline?

A real example of a problem it could create (which compiler heuristics mostly avoid):

Inlining if (rare_special_case) use_much_stack() into a recursive function that otherwise doesn't use much stack would be an obvious problem for performance (more cache and TLB misses), and even correctness if you recurse deep enough to actually overflow the stack.

(Especially in a constrained environment like Linux kernel stacks, typically 8kiB or 16kiB per thread, up from 4k on 32-bit platforms in older Linux versions. https://elinux.org/Kernel_Small_Stacks has some info and historical quotes about trying to get away with 4k stacks so the kernel didn't have to find 2 contiguous physical pages per task).

Compilers normally make functions allocate all the stack space they'll ever need up front (except for VLAs and alloca). Inlining an error-handling or special-case handling function instead of calling it in the rare case where it's needed will put a large stack allocation (and often save/restore of more call-preserved registers) in the main prologue/epilogue, where it affects the fast path, too. Especially if the fast path didn't make any other function calls.

If you don't inline the handler, that stack space will never be used if there aren't errors (or the special case didn't happen). So the fast-path can be faster, with fewer push/pop instructions and not allocating any big buffers before going on to call another function. (Even if the function itself isn't actually recursive, having this happen in multiple functions in a deep call tree could waste a lot of stack.)

I've read that the Linux kernel does manually do this optimization in a few key places where gcc's inlining heuristics make an unwanted decision to inline: break a function up into fast-path with a call to the slow path, and use __attribute__((noinline)) on the bigger slow-path function to make sure it doesn't inline.


In some cases not doing a separate allocation inside a conditional block is a missed optimization, but more stack-pointer manipulation makes stack unwinding metadata to support exceptions (and backtraces) more bloated (especially saving/restoring of call-preserved registers that stack unwinding for exceptions has to restore).

If you were doing a save and/or allocate inside a conditional block before running some common code that's reached either way (with another branch to decide which registers to restore in the epilogue), then there'd be no way for the exception handler machinery to know whether to load just R12, or R13 as well (for example) from where this function saved them, without some kind of insanely complicated metadata format that could signal a register or memory location to be tested for some condition. The .eh_frame section in ELF executables / libraries is bloated enough as is! (It's non-optional, BTW. The x86-64 System V ABI (for example) requires it even in code that doesn't support exceptions, or in C. In some ways that's good, because it means backtraces usually work, even passing an exception back up through a function would cause breakage.)

You can definitely adjust the stack pointer inside a conditional block, though. Code compiled for 32-bit x86 (with crappy stack-args calling conventions) can and does use push even inside conditional branches. So as long as you clean up the stack before leaving the block that allocated space, it's doable. That's not saving/restoring registers, just moving the stack pointer. (In functions built without a frame pointer, the unwind metadata has to record all such changes, because the stack pointer is the only reference for finding saved registers and the return address.)

I'm not sure exactly what the details are on why compiler can't / don't want to be smarter allocating large extra stack space only inside a block that uses it. Probably a good part of the problem is that their internals just aren't set up to be able to even look for this kind of optimization.


Related: Raymond Chen posted a blog about the PowerPC calling convention, and how there are specific requirements on function prologues / epilogues that make stack unwinding work. (And the rules imply / require the existence of a red zone below the stack pointer that's safe from async clobber. A few other calling conventions use red zones, like x86-64 System V, but Windows x64 doesn't. Raymond posted another blog about red zones)