How does stack allocation work in Linux?

It appears that the stack memory limit is not allocated (anyway, it couldn't with unlimited stack). https://www.kernel.org/doc/Documentation/vm/overcommit-accounting says:

The C language stack growth does an implicit mremap. If you want absolute guarantees and run close to the edge you MUST mmap your stack for the largest size you think you will need. For typical stack usage this does not matter much but it's a corner case if you really really care

However mmapping the stack would be the goal of a compiler (if it has an option for that).

EDIT: After some tests on an x84_64 Debian machine, I've found that the stack grows without any system call (according to strace). So, this means that the kernel grows it automatically (this is what the "implicit" means above), i.e. without explicit mmap/mremap from the process.

It was quite hard to find detailed information confirming this. I recommend Understanding The Linux Virtual Memory Manager by Mel Gorman. I suppose that the answer is in Section 4.6.1 Handling a Page Fault, with the exception "Region not valid but is beside an expandable region like the stack" and the corresponding action "Expand the region and allocate a page". See also D.5.2 Expanding the Stack.

Other references about Linux memory management (but with almost nothing about the stack):

  • Memory FAQ
  • What every programmer should know about memory by Ulrich Drepper

EDIT 2: This implementation has a drawback: in corner cases, a stack-heap collision may not be detected, even in the case where the stack would be larger than the limit! The reason is that a write in a variable in the stack may end up in allocated heap memory, in which case there is no page fault and the kernel cannot know that the stack needed to be extended. See my example in the discussion Silent stack-heap collision under GNU/Linux I started in the gcc-help list. To avoid that, the compiler needs to add some code at function call; this can be done with -fstack-check for GCC (see Ian Lance Taylor's reply and the GCC man page for details).


Linux kernel 4.2

  • mm/mmap.c#acct_stack_growth decides if it will segfault or not. It uses rlim[RLIMIT_STACK] which corresponds to the POSIX gerlimit(RLIMIT_STACK)
  • arch/x86/mm/fault.c#do_page_fault is the interrupt handler that starts a chain which ends up calling acct_stack_growth
  • arch/x86/entry/entry_64.S sets up the page fault handler. You need to know a bit about paging to understand that part: How does x86 paging work? | Stack Overflow

Minimal test program

We can then test it up with a minimal NASM 64-bit program:

global _start
_start:
    sub rsp, 0x7FF000
    mov [rsp], rax
    mov rax, 60
    mov rdi, 0
    syscall

Make sure that you turn off ASLR and remove environment variables as those will go on the stack and take up space:

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
env -i ./main.out

The limit is somewhere slightly below my ulimit -s (8MiB for me). Looks like this is because of extra System V specified data initially put on the stack in addition to the environment: Linux 64 command line parameters in Assembly | Stack Overflow

If you are serious about this, TODO make a minimal initrd image that starts writing from the stack top and goes down, and then run it with QEMU + GDB. Put a dprintf on the loop printing the stack address, and a breakpoint at acct_stack_growth. It will be glorious.

Related:

  • https://softwareengineering.stackexchange.com/questions/207386/how-are-the-size-of-the-stack-and-heap-limited-by-the-os
  • Where is the stack memory allocated from for a Linux process? | Stack Overflow
  • What is the Linux Stack? | Stack Overflow
  • What is the maximum recursion depth in Python, and how to increase it? on Stack Overflow

By default, the maximal stack size is configured to be 8MB per process,
but it can be changed using ulimit:

Showing the default in kB:

$ ulimit -s
8192

Set to unlimited:

ulimit -s unlimited

affecting the current shell and subshells and their child processes.
(ulimit is a shell builtin command)

You can show the actual stack address range in use with:
cat /proc/$PID/maps | grep -F '[stack]'
on Linux.