Why does the TeX scanner process tokens for register numbers and macro names differently?

I don't think that your description of scanning for a macro name is accurate (although I do not know what you mean by typeset:false, so that may or may not be true)

Note that scanning for macro names happens before tokenisation (or rather during it) as it is the determinisation of whether to make a csname token. After the character (not token) that has catcode 0 is seen then if the next character has catcode 11 all further catcode 11 characters are scanned as part of the csname, the first non-catcode 11 character is consumed and not tokenised if it would make a token of catcode 10 (space) otherwise it is put back into the input stream.

Numbers are part of tex's inner execution (what the TeXBook calls the stomach) so happen after tokenisation and after macro expansion. If TeX is parsing for a <number> then it will expand all tokens until it finds an unexpandable token that is part of a number syntax (so catcode 11 or 12 digits, or after " also hex letters A-F and other forms such as alphabetic character code numbers such as `a or `\a The first non expandable token that is not part of the syntax for <number> is always returned (as a token) to the input stream unless it is a space token, which is your T here.


Your assumptions are wrong. When doing an assignment to a count register, TeX obeys the syntax

<count register><optional equal sign><number>

where <count register> can be an internal numeric register (for instance \hyphenpenalty), a \countdef token (after \newcount\foo or a direct \countdef\foo=1) or \count<number>.

The <optional equal sign> can be empty, a space token, or =12 surrounded by (optional) space tokens (well it's a bit more complicated, but not really relevant).

A <number> consists of an optional radix indicator

` ' "

and then a sequence of digits in the corresponding radix. For the case of a backquote, the “digit” is a single character or a control sequence of length one (alphabetic constant). For ' the digits are among 01234567 (with category code 12); for " the digits are 0123456789ABCDEF (where the letters can have category code 11 or 12, but 0123456789 must have category code 12). Without an indicator, decimal is assumed and the admissible digits are 0123456789 with category code 12.

The <number> ends when, after macro expansion, TeX finds a token that cannot be interpreted as a suitable digit for the chosen radix. If this token is a space, it is gobbled.

Your example

\def\macro{123}

\count1=\macro
4

assigns to \count1 the integer 1234, because there is nothing between \macro and 4, as spaces are ignored during tokenization after a control word (and the end-of-line character gets converted to a space during the same tokenization phase). If you want to assign 123 and print 4, you have to say

\count1=\macro\space 4

(whether 4 is on the next line is irrelevant).

The syntax for \advance is similar (instead of = there is the optional by); so in

\advance\count1 by 1000T

the search for digits will stop at T, but it would stop at ! as well.

Some more words on the assignments above. With

\count1=\macro
4

TeX sees six tokens

\count • 1 • = • \macro • 4 • ⍽

(spaces are just for clarity, • separates tokens from each other). Exactly the same would be found with

\count1=\macro4
\count1=\macro 4

because of the rules about tokenization. There is no space token after \macro. The final denotes the space token deriving from the end-of-line, which is tokenized because it doesn't get ignored during macro name constructions.

Since \count is unexpandable, TeX “executes” it, which means it has to perform an assignment, so a <number> is searched for; the following token is 1 which is right; the next token is =, so the search for digits stops and TeX knows it has to assign a value to \count1. The = is gobbled because it optional and TeX starts searching for a number. The first token is expandable, so TeX expands it getting

1 • 2 • 3 • 4 • ⍽

Now stops the search for digits and the assignment can be performed. This space token is gobbled by rule.

Let's examine

\count1=42
\advance\count1 by1000The

and divide it into tokens:

\count•1•=•4•2•⍽•\advance•\count•1•⍽•b•y•1•0•0•0•T•h•e

(no typographic space for compactness, still separates tokens from each other). The assignment of 42 to \count1 is performed as before and the space token is gobbled. We remain with

\advance•\count•1•⍽•b•y•1•0•0•0•T•h•e

The execution of \advance makes TeX to search for an appropriate register name, which it finds as \count1 (the space token is ignored after the lookup for the register's number). Next by is gobbled because it is an optional keyword and we remain with

1•0•0•0•T•h•e

TeX is looking for a <number> so it does as before, with macro expansion. However there is no macro to expand and the search for digits stops at the first token that is not a digit for the radix 10, so at T. The assignment is performed and \count1 is assigned the value 1042. Next the token T is reexamined and it will start a paragraph.

Be careful because something like

\count1="1000At this point

will assign to \count1 the value 65546 and start a paragraph with t.

Always end constants with an unexpandable token that's not a digit. A space is good for this as it will be ignored.


What do you mean by "constructing a macro name"?

At the time of reading/tokenizing input it is not important whether the control sequence token in question will denote a macro or a primitive or an undefined control sequence.

At the time of reading and tokenizing input, there are no tokens yet, thus names of control sequence tokens do not consist of whatsoever tokens. They are formed by characters (not character tokens). \string can be used for obtaining a sequence of corresponding character tokens. \csname..\endcsname offers a means of denoting control sequence tokens by means of sequences of expandable tokens and character-tokens.

Always take care about when it is appropriate to use the term "token" and when it is not appropriate to use the term "token" as the -eh- "object of contemplation" is not a token/is not something that came into being as a result of tokenization or expansion.

Control sequence tokens come in two flavours:
1. Control symbol tokens.
2. Control word tokens.

A sequence of two characters in the input where the first character's category code is 0 (escape) and the second character's category code is not 11 (letter) is taken for the identification of a control symbol token and therefore the corresponding control symbol token will be "placed" into the "stream of tokens".

A sequence of characters in the input where the first character's category code is 0 (escape) and all trailing characters of that sequence have category code 11 (letter) is taken for the identification of a control word token and therefore the corresponding control word token will be "placed" into the "stream of tokens".

IIRC the "nameless" control sequence token, denoteable via \csname\endcsname or via a character whose category code is 0 (escape) at the end of a line of input in situations where no endline-character gets attached, is also a control word token.

Control sequence tokens that come into being as a result of expanding \csname..\endcsname or \ifcsname..\endcsname and whose names form character-sequences other than just a single non-catcode-11-character are also control word tokens.

After reading/tokenizing a control word token, TeX' reading apparatus will switch to state S (skipping blanks). When unexpanded-writing a control word token, TeX will write/attach an additional space character.

After reading/tokenizing a control symbol token, TeX' reading apparatus will switch to state M (middle of line). When unexpanded-writing a ontrol symbol token, TeX will not write/attach an additional space character.

(Be aware that treatment at writing-time of a one letter control sequence token either as a control word token or as a control symbol token depends on the category code of the letter/character in question at writing-time.)


The last thing that will be gathered with a \count-assignment is a <number>.

The components of quantities subsumed under the term <number> are explained in Donald E. Knuth's The TeXbook, Chapter 24: Summary of Vertical Mode.

When gathering/reading the components of a <number>, expansion takes place unless TeX is in a situation where TeX does not expand tokens. These situations are listed in Chapter 20: Definitions of The TeXbook.

Rakishly speaking you can say that the only thing that might be "consumed" in some cases of gathering components of a <number> is <one optional space>.