Does Pharo provide tail-call optimisation?

There is no mystery in the execution model of Pharo. The recursive fragment

^ self * (self - 1) factorial

that happens inside the second ifTrue: compiles to the following sequence of bytecodes:

39 <70> self                  ; receiver of outer message *
40 <70> self                  ; receiver of inner message -
41 <76> pushConstant: 1       ; argument of self - 1
42 <B1> send: -               ; subtract
43 <D0> send: factorial       ; send factorial (nothing special here!) 
44 <B8> send: *               ; multiply
45 <7C> returnTop             ; return

Note that in line 43 nothing special happens. The code just sends factorial in the same way it would, had the selector been any other. In particular we can see that there is no special manipulation of the stack here.

This doesn't mean that there cannot be optimizations in the underlying native code. But that is a different discussion. It is the execution model the one that matters to the programmer because any optimization underneath bytecodes is meant to support this model at the conceptual level.

UPDATE

Interestingly, the non-recursive version

factorial2
  | f |
  f := 1.
  2 to: self do: [:i | f := f * i].
  ^f

is a little bit slower that the recursive one (Pharo). The reason must be that the overhead associated to increasing i is a little bit greater than the recursive send mechanism.

Here are the expressions I tried:

[25000 factorial] timeToRun
[25000 factorial2] timeToRun

IMHO, the initial code that is presumed to have a tail-recursive call to factorial

factorial
        "Answer the factorial of the receiver."

        self = 0 ifTrue: [^ 1].
        self > 0 ifTrue: [^ self * (self - 1) factorial].
        self error: 'Not valid for negative integers'

it isn't, actually. The bytecode reported by Leandro's reply proofs that:

39 <70> self                  ; receiver of outer message *
40 <70> self                  ; receiver of inner message -
41 <76> pushConstant: 1       ; argument of self - 1
42 <B1> send: -               ; subtract
43 <D0> send: factorial       ; send factorial (nothing special here!) 
44 <B8> send: *               ; multiply
45 <7C> returnTop             ; return

before the returnTop there is a send of * instead of factorial. I would have written a message using an accumulator as

factorial: acc
    ^ self = 0
        ifTrue: [ acc ]
        ifFalse: [ self - 1 factorial: acc * self ]

that produces the bytecode reported in this picture.

Btw,

n := 10000.
[n slowFactorial] timeToRun .
[n factorial] timeToRun.
[n factorial: 1] timeToRun.

both the first and second ones takes 29 millisecs, the last one 595 millisecs on a fresh Pharo 9 image. Why so slow?


It's a really deep stack. Or rather, no stack at all.

Pharo is a descendent of Squeak, which inherits its execution semantics directly from Smalltalk-80. There is no linear fixed-size stack, instead every method call creates a new MethodContext object which provides the space for arguments and temporary variables in each recursive call. It also points to the sending context (for later return) creating a linked list of contexts (which is displayed just like a stack in the debugger). Context objects are allocated on the heap just like any other object. That means call chains can be very deep, since all available memory can be used. You can inspect thisContext to see the currently active method context.

Allocating all these context objects is expensive. For speed, modern VMs (such as the Cog VM used in Pharo) do actually use a stack internally, which consists of linked pages, so it can be arbitrarily large as well. The context objects are only created on demand (e.g. while debugging) and refer to the hidden stack frames and vice versa. This machinery behind the scenes is quite complex, but fortunately hidden from the Smalltalk programmer.