What is tail call optimization?

TCO (Tail Call Optimization) is the process by which a smart compiler can make a call to a function and take no additional stack space. The only situation in which this happens is if the last instruction executed in a function f is a call to a function g (Note: g can be f). The key here is that f no longer needs stack space - it simply calls g and then returns whatever g would return. In this case the optimization can be made that g just runs and returns whatever value it would have to the thing that called f.

This optimization can make recursive calls take constant stack space, rather than explode.

Example: this factorial function is not TCOptimizable:

from dis import dis

def fact(n):
    if n == 0:
        return 1
    return n * fact(n-1)


dis(fact)
  2           0 LOAD_FAST                0 (n)
              2 LOAD_CONST               1 (0)
              4 COMPARE_OP               2 (==)
              6 POP_JUMP_IF_FALSE       12

  3           8 LOAD_CONST               2 (1)
             10 RETURN_VALUE

  4     >>   12 LOAD_FAST                0 (n)
             14 LOAD_GLOBAL              0 (fact)
             16 LOAD_FAST                0 (n)
             18 LOAD_CONST               2 (1)
             20 BINARY_SUBTRACT
             22 CALL_FUNCTION            1
             24 BINARY_MULTIPLY
             26 RETURN_VALUE

This function does things besides call another function in its return statement.

This below function is TCOptimizable:

def fact_h(n, acc):
    if n == 0:
        return acc
    return fact_h(n-1, acc*n)

def fact(n):
    return fact_h(n, 1)


dis(fact)
  2           0 LOAD_GLOBAL              0 (fact_h)
              2 LOAD_FAST                0 (n)
              4 LOAD_CONST               1 (1)
              6 CALL_FUNCTION            2
              8 RETURN_VALUE

This is because the last thing to happen in any of these functions is to call another function.


Tail-call optimization is where you are able to avoid allocating a new stack frame for a function because the calling function will simply return the value that it gets from the called function. The most common use is tail-recursion, where a recursive function written to take advantage of tail-call optimization can use constant stack space.

Scheme is one of the few programming languages that guarantee in the spec that any implementation must provide this optimization, so here are two examples of the factorial function in Scheme:

(define (fact x)
  (if (= x 0) 1
      (* x (fact (- x 1)))))

(define (fact x)
  (define (fact-tail x accum)
    (if (= x 0) accum
        (fact-tail (- x 1) (* x accum))))
  (fact-tail x 1))

The first function is not tail recursive because when the recursive call is made, the function needs to keep track of the multiplication it needs to do with the result after the call returns. As such, the stack looks as follows:

(fact 3)
(* 3 (fact 2))
(* 3 (* 2 (fact 1)))
(* 3 (* 2 (* 1 (fact 0))))
(* 3 (* 2 (* 1 1)))
(* 3 (* 2 1))
(* 3 2)
6

In contrast, the stack trace for the tail recursive factorial looks as follows:

(fact 3)
(fact-tail 3 1)
(fact-tail 2 3)
(fact-tail 1 6)
(fact-tail 0 6)
6

As you can see, we only need to keep track of the same amount of data for every call to fact-tail because we are simply returning the value we get right through to the top. This means that even if I were to call (fact 1000000), I need only the same amount of space as (fact 3). This is not the case with the non-tail-recursive fact, and as such large values may cause a stack overflow.


Probably the best high level description I have found for tail calls, recursive tail calls and tail call optimization is the blog post

"What the heck is: A tail call"

by Dan Sugalski. On tail call optimization he writes:

Consider, for a moment, this simple function:

sub foo (int a) {
  a += 15;
  return bar(a);
}

So, what can you, or rather your language compiler, do? Well, what it can do is turn code of the form return somefunc(); into the low-level sequence pop stack frame; goto somefunc();. In our example, that means before we call bar, foo cleans itself up and then, rather than calling bar as a subroutine, we do a low-level goto operation to the start of bar. Foo's already cleaned itself out of the stack, so when bar starts it looks like whoever called foo has really called bar, and when bar returns its value, it returns it directly to whoever called foo, rather than returning it to foo which would then return it to its caller.

And on tail recursion:

Tail recursion happens if a function, as its last operation, returns the result of calling itself. Tail recursion is easier to deal with because rather than having to jump to the beginning of some random function somewhere, you just do a goto back to the beginning of yourself, which is a darned simple thing to do.

So that this:

sub foo (int a, int b) {
  if (b == 1) {
    return a;
  } else {
    return foo(a*a + a, b - 1);
  }

gets quietly turned into:

sub foo (int a, int b) {
  label:
    if (b == 1) {
      return a;
    } else {
      a = a*a + a;
      b = b - 1;
      goto label;
   }

What I like about this description is how succinct and easy it is to grasp for those coming from an imperative language background (C, C++, Java)


Let's walk through a simple example: the factorial function implemented in C.

We start with the obvious recursive definition

unsigned fac(unsigned n)
{
    if (n < 2) return 1;
    return n * fac(n - 1);
}

A function ends with a tail call if the last operation before the function returns is another function call. If this call invokes the same function, it is tail-recursive.

Even though fac() looks tail-recursive at first glance, it is not as what actually happens is

unsigned fac(unsigned n)
{
    if (n < 2) return 1;
    unsigned acc = fac(n - 1);
    return n * acc;
}

ie the last operation is the multiplication and not the function call.

However, it's possible to rewrite fac() to be tail-recursive by passing the accumulated value down the call chain as an additional argument and passing only the final result up again as the return value:

unsigned fac(unsigned n)
{
    return fac_tailrec(1, n);
}

unsigned fac_tailrec(unsigned acc, unsigned n)
{
    if (n < 2) return acc;
    return fac_tailrec(n * acc, n - 1);
}

Now, why is this useful? Because we immediately return after the tail call, we can discard the previous stackframe before invoking the function in tail position, or, in case of recursive functions, reuse the stackframe as-is.

The tail-call optimization transforms our recursive code into

unsigned fac_tailrec(unsigned acc, unsigned n)
{
TOP:
    if (n < 2) return acc;
    acc = n * acc;
    n = n - 1;
    goto TOP;
}

This can be inlined into fac() and we arrive at

unsigned fac(unsigned n)
{
    unsigned acc = 1;

TOP:
    if (n < 2) return acc;
    acc = n * acc;
    n = n - 1;
    goto TOP;
}

which is equivalent to

unsigned fac(unsigned n)
{
    unsigned acc = 1;

    for (; n > 1; --n)
        acc *= n;

    return acc;
}

As we can see here, a sufficiently advanced optimizer can replace tail-recursion with iteration, which is far more efficient as you avoid function call overhead and only use a constant amount of stack space.