How can a compiler compile itself?

The first edition of a compiler can't be machine-generated from a programming language specific to it; your confusion is understandable. A later version of the compiler with more language features (with source rewritten in the first version of the new language) could be built by the first compiler. That version could then compile the next compiler, and so on. Here's an example:

  1. The first CoffeeScript compiler is written in Ruby, producing version 1 of CoffeeScript
  2. The source code of the CS compiler is rewritten in CoffeeScript 1
  3. The original CS compiler compiles the new code (written in CS 1) into version 2 of the compiler
  4. Changes are made to the compiler source code to add new language features
  5. The second CS compiler (the first one written in CS) compiles the revised new source code into version 3 of the compiler
  6. Repeat steps 4 and 5 for each iteration

Note: I'm not sure exactly how CoffeeScript versions are numbered, that was just an example.

This process is usually called bootstrapping. Another example of a bootstrapping compiler is rustc, the compiler for the Rust language.


You have already gotten a very good answer, however I want to offer you a different perspective, that will hopefully be enlightening to you. Let's first establish two facts that we can both agree on:

  1. The CoffeeScript compiler is a program which can compile programs written in CoffeeScript.
  2. The CoffeeScript compiler is a program written in CoffeeScript.

I'm sure you can agree that both #1 and #2 are true. Now, look at the two statements. Do you see now that it is completely normal for the CoffeeScript compiler to be able to compile the CoffeeScript compiler?

The compiler doesn't care what it compiles. As long as it's a program written in CoffeeScript, it can compile it. And the CoffeeScript compiler itself just happens to be such a program. The CoffeeScript compiler doesn't care that it's the CoffeeScript compiler itself it is compiling. All it sees is some CoffeeScript code. Period.

How can a compiler compile itself, or what does this statement mean?

Yes, that's exactly what that statement means, and I hope you can see now how that statement is true.


In the paper Reflections on Trusting Trust, Ken Thompson, one of the originators of Unix, writes a fascinating (and easily readable) overview of how the C compiler compiles itself. Similar concepts can be applied to CoffeeScript or any other language.

The idea of a compiler that compiles its own code is vaguely similar to a quine: source code that, when executed, produces as output the original source code. Here is one example of a CoffeeScript quine. Thompson gave this example of a C quine:

char s[] = {
    '\t',
    '0',
    '\n',
    '}',
    ';',
    '\n',
    '\n',
    '/',
    '*',
    '\n',
    … 213 lines omitted …
    0
};

/*
 * The string s is a representation of the body
 * of this program from '0'
 * to the end.
 */

main()
{
    int i;

    printf("char\ts[] = {\n");
    for(i = 0; s[i]; i++)
        printf("\t%d,\n", s[i]);
    printf("%s", s);
}

Next, you might wonder how the compiler is taught that an escape sequence like '\n' represents ASCII code 10. The answer is that somewhere in the C compiler, there is a routine that interprets character literals, containing some conditions like this to recognize backslash sequences:

…
c = next();
if (c != '\\') return c;        /* A normal character */
c = next();
if (c == '\\') return '\\';     /* Two backslashes in the code means one backslash */
if (c == 'r')  return '\r';     /* '\r' is a carriage return */
…

So, we can add one condition to the code above…

if (c == 'n')  return 10;       /* '\n' is a newline */

… to produce a compiler that knows that '\n' represents ASCII 10. Interestingly, that compiler, and all subsequent compilers compiled by it, "know" that mapping, so in the next generation of the source code, you can change that last line into

if (c == 'n')  return '\n';

… and it will do the right thing! The 10 comes from the compiler, and no longer needs to be explicitly defined in the compiler's source code.1

That is one example of a C language feature that was implemented in C code. Now, repeat that process for every single language feature, and you have a "self-hosting" compiler: a C compiler that is written in C.


1 The plot twist described in the paper is that since the compiler can be "taught" facts like this, it can also be mis-taught to generate trojaned executables in a way that is difficult to detect, and such an act of sabotage can persist in all compilers produced by the tainted compiler.


How can a compiler compile itself, or what does this statement mean?

It means exactly that. First of all, some things to consider. There are four objects we need to look at:

  • The source code of any arbitrary CoffeScript program
  • The (generated) assembly of any arbitrary CoffeScript program
  • The source code of the CoffeScript compiler
  • The (generated) assembly of the CoffeScript compiler

Now, it should be obvious that you can use the generated assembly - the executable - of the CoffeScript compiler to compile any arbitrary CoffeScript program, and generate the assembly for that program.

Now, the CoffeScript compiler itself is just an arbitrary CoffeScript program, and thus, it can be compiled by the CoffeScript compiler.

It seems that your confusion stems from the fact that when you create your own new language, you don't have a compiler yet you can use to compile your compiler. This surely looks like an chicken-egg problem, right?

Introduce the process called bootstrapping.

  1. You write a compiler in an already existing language (in case of CoffeScript, the original compiler was written in Ruby) that can compile a subset of the new language
  2. You write a compiler that can compile a subset of the new language in the new language itself. You can only use language features the compiler from the step above can compile.
  3. You use the compiler from step 1 to compile the compiler from step 2. This leaves you with an assembly that was originally written in a subset of the new language, and that is able to compile a subset of the new language.

Now you need to add new features. Say you have only implemented while-loops, but also want for-loops. This isn't a problem, since you can rewrite any for-loop in such a way that it is a while-loop. This means you can only use while-loops in the source code of your compiler, since the assembly you have at hand can only compile those. But you can create functions inside your compiler that can pase and compile for-loops with it. Then you use the assembly you already have, and compile the new compiler version. And now you have an assembly of an compiler that can also parse and compile for-loops! You can now go back to the source file of your compiler, and rewrite any while-loops you don't want into for-loops.

Rinse and repeat until all language features that are desired can be compiled with the compiler.

while and for obviously were only examples, but this works for any new language feature you want. And then you are in the situation CoffeScript is in now: The compiler compiles itself.

There is much literature out there. Reflections on Trusting Trust is a classic everyone interested in that topic should read at least once.

Tags:

Compilation