Is there a BNF grammar of the TeX language?

Parsing TeX is Turing complete

TeX can only be parsed by a complete Turing machine (modulo the finite space available), which precludes it from having a BNF. This comes from a combination of two features: first, TeX is Turing complete (if you need proof, this Turing machine simulator should suffice); and second, TeX can redefine macros (and their parsing rules) at runtime. Since TeX can require that macros be followed by specific characters, redefining a macro can mean redefining the syntax of TeX. Combining these facts means that we can write TeX code like the following, where \f is defined to be the algorithmic representation in TeX of some arbitrary (computable) function f : ℤ → ℤ:

\def\TuringCompleteness#1#2{%
  \edef\Output{#1{#2}}%
  \ifnum\Output=0
    \def\Required!{Bang!}%
  \else
    \def\Required?{Huh?}%
  \fi}
\TuringCompleteness{\f}{0}
\Required!

Will this TeX parse? That depends on the value of f(0) / \f{0}! And since TeX is Turing-complete, computing this value may require a full Turing machine, and therefore so will determining whether or not \Required! is valid TeX syntax.


It gets worse: category codes

Perhaps, though, you think that you can still perform a rough check for code that looks like a backslash, followed by letters, followed by braced stuff. This will already get inadequate quickly, but that's not all. If delimited macros weren't bad enough, things get even worse when you realize that the reason that the characters \, {, }, $, and so on have the behavior they do is simply because TeX specifies that they have particular "category codes". The ability to redefine these means that writing TeX like

\catcode`\~=0
\catcode`\]=1
\catcode`\[=2
\catcode`\}=12
\catcode`\{=12
\catcode`\\=12

means that all subsequent (La)TeX code must be written like so:

~textbf]This is bold-faced text.[  This text contains a literal backslash
(~texttt]\[) and literal curly braces (~texttt]{}[).

to get output like so:

This is bold-faced text. This text contains a literal backslash (\) and literal curly braces ({}).


Constructive Proof that TeX is (at least) Context-Sensitive

Additionally, inspired by TH.'s answer, here's a specific constructive proof that TeX is not context-free. And note that it doesn't have to rely on catcode trickery—just TeX's ability to define arbitrarily-delimited macros. (Yes, this is implied by the first section, but it's still nice to see a specific case in action.)

\newcount\nesting

\def\RecognizeABC#1{\begingroup
  \nesting=0
  \OneStep#1\relax
  \ifnum\nesting=1
    There was \the\nesting\ copy of each control sequence.\par
  \else
    There were \the\nesting\ copies of each control sequence.\par
  \fi
\endgroup}

\def\CheckFinished#1{%
  \ifx#1\relax
    \def\OneStep#1{\relax}%
  \fi
  \OneStep#1}

\def\OneStep\a#1\b#2\c#3\relax{%
  \advance\nesting by 1
  \CheckFinished#1#2#3\relax}

This code is set up to recognize what Wikipedia calls "the canonical non-context free language" of n copies of \a followed by n copies of \b followed by n copies of \c, where n is at least one. The first line just allocates a counter; we'll use this to keep track of how many copies of the control sequences we see, but it's just so we can display some text at the end, not for parsing. To attempt recognition of some text, call \RecognizeABC{...}. This simply sets \nesting to zero (for, again, output purposes), and calls \OneStep. \OneStep is set up to take three arguments delimited by \a, \b, and \c. The first argument must come between \a and \b, the second between \b and \c, and the third between \c and \relax. If there are the same number of each of \a, \b, and \c, then this will eventually be \OneStep\a\b\c\relax and all the arguments will be empty. With \OneStep, we just record the fact that there's one more of each \a, \b, and \c (which is, again, just for output), and then we see if we're done. \CheckFinished looks to see what the next token in the input stream is. If it's \relax, then the arguments were empty, and we're done; redefine \OneStep not to do anything. Now, call \OneStep—whether it's the parsing macro or nothing at all—on the remaining input stream, recursing. Once we're done, we finally finish \RecognizeABC, where we typeset how many copies of each control sequence we saw.

Phew! If you didn't follow all that, here's what a trace looks like, if we only look at the important macros:

[START] \RecognizeABC{\a\a\a\b\b\b\c\c\c}
    --> \OneStep\a\a\a\b\b\b\c\c\c\relax
                  -#1-  -#2-  -#3-
    --> \CheckFinished\a\a\b\b\c\c\relax
    --> \OneStep\a\a\b\b\c\c\relax
                  #1  #2  #3
    --> \CheckFinished\a\b\c\relax
    --> \OneStep\a\b\c\relax
          (All the arguments are empty.)
    --> \CheckFinished\relax
    --> ``There were 3 copies of each control sequence.''

Thus, the following TeX

\RecognizeABC{\a\b\c}
\RecognizeABC{\a\a\b\b\c\c}
\RecognizeABC{\a\a\a\b\b\b\c\c\c}

Yields the following output:

There was 1 copy of each control sequence.
There were 2 copies of each control sequence.
There were 3 copies of each control sequence.

But TeX such as

\RecognizeABC{}
\RecognizeABC{\a}
\RecognizeABC{\a\a\b\b\c\c\c}
\RecognizeABC{\a\b\c\garbage}

fails.


Edit 1: In addition to minor bugfixes/cleanup and breaking the answer into sections with headings, I expanded the rationale for TeX being unparsable by anything short of a Turing machine to hopefully make it slightly more clear. This involved removing the previous \weird example, replacing it with something that's more of a proof. If you'd rather have a concrete example (but non-proof) with a command which redefines itself, consider the following TeX, which was there before:

\def\weird[#1][#2]{%
  \ifx#1\relax
    \def\weird!{I was previously called with a relaxing first argument.}%
  \else\ifx#2\relax
    \def\weird(##1){I can only be called with one argument; I was given (##1).}%
  \fi\fi
  I was called with [#1] and [#2].}

To start, \weird must be called as \weird[stuff][other stuff]. But if its first argument is \relax, then it redefines itself to require being called as \weird!; if its second argument is nothing, it redefines itself to require being called as \weird(stuff); and it always typesets text. This can't be captured by a BNF grammar (I don't think). And of course, in the general case, we need a Turing machine.


Just to add another angle to the excellent posts by TH. and Antal, is that besides TeX being a Turing complete machine, it can also be coerced to do lambda calculus, the basis of all functional languages. On CTAN there is a package called lambda developed by Alan Jeffrey that implements, among other lambda concepts, unbounded lists. Not only does TeX do all these, but impressively, the routines were implemented in TeX's mouth, proving that TeX's mouth is as powerful as any computer anywhere.

To understand the relationship between the lambda calculus and Turing machines one needs to refer to the Entscheidungsproblem which is one of the famous 23 problems proposed by David Hilbert.

In 1936 and 1937 Alonzo Church and Alan Turing respectively, published independent papers showing that it is impossible to decide algorithmically whether statements in arithmetic are true or false, and thus a general solution to the Entscheidungsproblem is impossible.

This was done by Alonzo Church in 1936 with the concept of "effective calculability" based on his λ calculus and by Alan Turing in the same year with his concept of Turing machines. It was later recognized that these are equivalent models of computation. The work of both authors was heavily influenced by Kurt Gödel's earlier work on his incompleteness theorem - Wikipedia

So the lambda calculus, Turing machines and TeX are not just closely related but they are equivalent models of computation and this is one of many aspects that makes TeX and friends interesting!


It's extremely unlikely. TeX can change how it parses input while it's in the middle of reading the input. I'd have to think a bit longer, but I think that precludes TeX from having a context-free grammar and thus it could have no BNF.