Conversion of space characters into space tokens

Usually TeX does process input line by line:

The whole line is read and the whole line is pre-processed.

  • One step of pre-processing the whole line is:
    All characters of the character-sequence which forms that line are converted from the computer-platform's character-representation-scheme to the TeX-engine's internal character-representation-scheme.
    The computer-platform's character-representation-scheme might be whatsoever character-encoding. With modern computers this often is Unicode (and often the transformation-format is UTF-8). With older machines, e.g., running under MS-DOS, this might be some 8bit encoding/byte-encoding whereof ASCII (American Standard Code for Information Interchange) is a subset; e.g., when running under Win95/Win98/NT, this might be, e.g., Windows-1252 or iso-8859-1/iso-8859-15 or whatever.
    With traditional TeX-engines the TeX-engine's internal character-representation-scheme is ASCII. With XeTeX- and with LuaTeX-enginges the TeX-engine's internal character-representation-scheme is Unicode (whereof ASCII is a subset).
  • Another step of pre-processing the whole line is:
    All spaces, i.e., all characters whose code-point has the number 32 in the TeX-engine's internal character-representation-scheme/in ASCII/in Unicode, that occur at the right end of the line, get removed.
  • Yet another step of pre-processing the whole line is:
    At the right end of the line a character gets appended whose code-point-number in the TeX engine's internal character-representation-schmeme equals the value of the integer-parameter \endlinechar.
  • The reading-apparatus is switched to state N(new line).

After pre-processing TeX starts tokenizing the pre-processed line.

This means TeX "looks" at the pre-processed line character by character and hereby takes the character-sequence as a set of directives for appending tokens to the token-stream. Hereby category-codes of characters play a rôle.

["Looking" at the pre-processed line character by character and appending tokens to the token-stream takes place "on demand", i.e., only when TeX needs tokens while the token-stream is empty. E.g., when the token-stream is empty while gathering macro-arguments or a ⟨balenaced text⟩, or when "looking" whether there is more work to do as no command for ending the job—something like (plain TeX) \bye or \end or (LaTeX) \stop or \end{document}—has been encountered yet.
On the one hand assigning another value to the integer-parameter \endlinechar does affect the pre-processing of lines of input. Thus an assignment to \endlinechar does not affect the line of input wherein it occurs (but only subsequent lines) because obviously that line is already pre-processed at the time when the assignment is carried out.
On the other hand changing category-codes may affect tokenization of things while tokenization takes place on demand after pre-processing. Therefore changing category-codes might affect the tokenizing of things that (even in the current line) appear right after the assignment for changing the category-codes.
Changing the category-code of the "endline-character" may affect how the (during the pre-processing of the current line already appended) "endline-character" of the current line gets tokenized.

You can, e.g., type "I must not talk in class!" ten times by assigning \endlinechar a nice value and making the corresponding character active and defining that active character to deliver a horizontal box holding the phrase "I must not talk in class!" and then adding ten empty lines to the .tex-input (by hitting return ten times while typeing the sourecode), yielding the insertion of ten endline-characters during compilation as each of these ten empty lines gets pre-processed—notice that the \endlinechar-assignment does not affect the line wherein it occurs (but only subsequent lines) because that line is already pre-processed at the time when that \endlinechar-assignment is carried out. Each of the ten inserted endline-characters in turn gets tokenized as the mentioned active character delivering the horizontal box with the phrase "I must not talk in class!" :

\begingroup
%  Let's make 'A' active:
\catcode`\A=13 %
% Let's have a scratch-counter for counting how many times
% the phrase "I must not talk in class!" is written:
\newcount\scratchcount
% Let's define the active-'A' to do some counting and to
% deliver the line "I must not talk in class!":
\def A{%
   % Ensure vertical mode:
   \ifvmode\else\par\fi
   % Increment the scratch-counter and place the line/
   % the horizontal box:
   \advance\scratchcount by 1 %
   \hbox{\number\scratchcount.\null\ I must not talk in class!}%
}%
% Make the character 'A' the endline-character:
\endlinechar=`\A\relax
% (The \endlinechar-assignment in the line above does not affect
% that line. It does affect subsequent lines only. It does not
% lead to appending the character 'A' to that line as at the time 
% of carrying out that assignment in TeX's stomach, that line is 
% already pre-processed with the old value of \endlinechar (which
% is 13, denoting the return-character) ). 
% 
% Now let's have ten empty lines, yielding ten endline-characters
% 'A' whereof each gets tokenized as active-'A' expanding to the
% directives for doing some counting and delivering the line with
% the phrase "I must not talk in class!".










\endgroup%
% The comment-char at the end of the line above must be as the line
% above obviously gets pre-processed _before_ carrying out \endgroup
% and thus it also will have an endlinechar-'A' appended. 
% Without the comment-char that 'A' would--as at the time of gathering
% the characters that form the name of the control-word-token '\endgr...'
% the  character 'A' is not of category-code 11(letter)--not be taken for 
% something that belongs to the name of that "\endgr..."-control-word-token
% and therefore would trigger termination of gathering the name of the
% '\endgr...'-control-word-token and would be put back into the input
% stream.
% After processing/carrying out the control-word-token '\endgroup', 'A'
% is of category-code 11(letter).
% Therefore processing/tokenizing the 'A' that was put back into the
% input-steam would yield an 'A'-character-token of category-code
% 11(letter), at some later stage of processing yielding a glyph 'A'
% within the output-file/within the .dvi- or .pdf-file.
%
% Now let's get the token '\bye' in a funny way:
\endlinechar=`e
\by

enter image description here

]

Let's look at your code:

Line 1:  \def\foo#1{(#1)\baz}%
Line 2:  \def\baz{baz}%
Line 3:  \foo{bla} Bar
Line 4:  \bye

Line 1 and 2 are code-lines without spaces, thus here no space-tokens come into being. We don't go into details here. Each of these lines ends with a percent-character while the percent-character has category-code 14(comment). With each of these lines due to the integer-parameter \endlinechar having the value 13 (13 denotes the return character in the TeX-engine's internal character-representation-scheme/in ASCII/in Unicode) a return-character will be appended behind that percent-character at the stage of pre-processing. But at the stage of tokenizing, characters of category-code 14(comment) (when not being taken for the name of a control-symbol-token) cause TeX to cease tokenizing the current line of input and to start processing the next line of input if present. Thus a percent-character within a line of input doesn't yield appending a token to the token-stream at all but causes TeX to silently "drop" it and that line of input's remainig characters. As the return-character appended due to \endlinechar also belongs to the remaining characters of that line of input, it is silently dropped as well.

Line 3 is pre-processed (by TeX's eyes) as follows:

The line is read and its single characters are converted to the TeX-engine's internal character-representation-scheme.

There are no spaces at the right end of the line. Thus there are no spaces at the right end of the line to remove.

Due to \endlinechar (usually) having the value 13 while 13 is the number of the code-point of the return-character in ASCII/in Unicode/in the TeX-engine's internal character-representation-scheme, (usually) a return-character is inserted behind the last character of the line, which is r. Usually the return-character has the category-code 5(end of line).

When TeX (in its mouth) begins tokenizing the pre-processed line, the reading-apparatus is switched to state N(new line).
(When the reading-apparatus is in state N(new line), then

  • space-characters do not yield appending tokens to the token-stream at all but are simply dropped, and
  • a character of category-code 5(end of line) yields appending the control-word-token \par to the token-stream and also causes TeX to cease tokenizing the remaining characters of the current line/and also causes TeX to drop the remaining characters of the current line and to start processing the next line of input if present.

)

Thus TeX's mouth by and by, i.e., whenever tokens are needed, tokenizes the pre-processed line/the pre-processed input-character-sequence (now converted to the TeX-engine's internal character-representation-scheme)

\foo{bla}⟨space-character⟩Bar⟨return-character⟩

as follows:

  • Control-word-token \foo. (After appending a control-word-token to the token-stream, the reading-apparatus is switched to state S(skipping blanks).)

    As \foo is a macro which processes an argument, the argument needs to be obtained by tokenizing some more input:

  • Explicit character-token { (opening curly brace) of category-code 1(begin group). (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)

  • Explicit character-token b of category-code 11(letter). (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)
  • Explicit character-token l of category-code 11(letter). (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)
  • Explicit character-token a of category-code 11(letter). (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)
  • Explicit character-token } (closing curly brace) of category-code 2(end group). (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)

  • Thus the following tokens are now send down from TeX's mouth to TeX's stomach—on their way to the stomach the tokens go through TeX's gullet, where expansion takes place:
    \foo(control-word-token){1(begin-group)b11(letter)l11(letter)a11(letter)}2(end group)
    while TeX's mouth still holds the remaining pre-processed input-character-sequence
    ⟨space-character⟩Bar⟨return-character⟩.

  • Expansion of these tokens while passing TeX's gullet yields:

    \foo requires a non-delimited argument. Explicit space-tokens preceding a non-delimited macro argument get discarded while gathering the tokens that form the argument. (A non-delimited argument either is a single token (which neither is an explicit space-token nor is an explicit character-token of category-code 1(begin group) nor is an explicit character-token of category-code 2(end group) nor is an \outer-token) or consists of a pair of matching curly braces (opening brace and closing brace) wherein a brace-balanced set of non-\outer-tokens is nested. That brace-balanced set of tokens can be "empty".) If present, a pair of matching curly braces that surrounds an entire macro argument (be it a delimited or a non-delimited macro argument) gets discarded when delivering the macro's replacement-text.
    Expansion of \foo yields the following replacement:

    (12(other)b11(letter)l11(letter)a11(letter))12(other)\baz(control-word-token)

    The mouth still holds the remaining pre-processed input-character-sequence
    ⟨space-character⟩Bar⟨return-character⟩.

  • While these tokens are slipping down the gullet, the expandable control-word-token \baz gets expanded as well—the following tokens reach the stomach of TeX:

    (12(other)b11(letter)l11(letter)a11(letter))12(other)b11(letter)a11(letter)z11(letter)

    Processing these tokens in the stomach (where assignments take place and boxes are bulit and paragraphs are split across lines and lines are placed on pages etc) yields switching to horizontal mode and to adding the glyph-sequence
    (bla)baz
    to the horizontal list from which the next line of text for the output-file/the .pdf-file is to be constructed.

    TeX's mouth still holds the remaining pre-processed input-character-sequence
    ⟨space-character⟩Bar⟨return-character⟩.

  • There is no indication that the job is to be finished, so TeX keeps its digestive processes going:

    The reading-apparatus is neither in state N(new line) nor in state S(skipping blanks) but is in state M(middle of line) and TeX is not gathering the name of a control-symbol-token. So from the remaining pre-processed input-character-sequence in its mouth
    ⟨space-character⟩Bar⟨return-character⟩
    it tokenizes the ⟨space-character⟩ as an explicit space-token (character-code 32, category-code 10(space)) and appends that to the token-stream/sends that down its gullet towards the stomach.
    (After appending an explicit character-token of category-code 10(space) or after appending a control-space (), the reading-apparatus is switched to state S(skipping blanks).)
    As TeX is in horizontal mode, the space-token in the stomach causes TeX to add horizontal glue to the horizontal list which in turn (if not discarded for some reason) yields visible horizontal empty space in the .pdf-output-file.

    TeX's mouth holds the remaining pre-processed input-character-sequence
    Bar⟨return-character⟩.

  • There is no indication that the job is to be finished, so TeX keeps its digestive processes going:

    From the remaining pre-processed input-character-sequence in its mouth it tokenizes the explicit character-token B of category-code 11(letter) and sends that down its gullet toward the stomach. (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)

    TeX's mouth holds the remaining pre-processed input-character-sequence
    ar⟨return-character⟩.

  • There is no indication that the job is to be finished, so TeX keeps its digestive processes going:

    From the remaining pre-processed input-character-sequence in its mouth it tokenizes the explicit character-token a of category-code 11(letter) and sends that down its gullet toward the stomach. (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)

    TeX's mouth holds the remaining pre-processed input-character-sequence
    r⟨return-character⟩.

  • There is no indication that the job is to be finished, so TeX keeps its digestive processes going:

    From the remaining pre-processed input-character-sequence in its mouth it tokenizes the explicit character-token r of category-code 11(letter) and sends that down its gullet toward the stomach. (After appending an explicit character-token which is not of category-code 10(space) or after appending a control-symbol-token differing from control-space (), the reading-apparatus is switched to state M(middle of line).)

    TeX's mouth holds the remaining pre-processed input-character-sequence
    ⟨return-character⟩.

  • There is no indication that the job is to be finished, so TeX keeps its digestive processes going:

    As TeX is not gathering the name of a control-symbol-token and as the reading-apparatus is in state M(middle of line) while the return-character has category-code 5(end of line), TeX will append to the token-stream and send down its gullet an explicit space-token (character-code 32, category-code 10(space)).

    (If TeX were encountering a character of category-code 5(end of line) while the reading-apparatus was in state N(new line) and TeX was not gathering the name of a control-symbol-token, then TeX would append the control-word-token \par to the token-stream.
    That's why under normal circumstances

    • empty lines in the source-code and
    • lines in the source-code containing nothing but space-characters and
    • lines in the source-code containing nothing but a mixture of characters of category-code 9(ignore) and 10(space), that mixture probably trailed by some space-characters

    yield the control-word-token \par. (With each of these cases none of the characters (if present) in that line leads to inserting a token into the token-stream, thus the reading-apparatus still is in state N when encountering the return-character of category code 5(end of line) that got inserted due to the value of \endlinechar at the right end of the line at the stage of pre-processing the line.)

    If TeX were encountering a character of category-code 5(end of line) while the reading-apparatus was in state S(skipping blanks) and TeX was not gathering the name of a control-symbol-token, then TeX would not append a token at all to the token-stream.)

    When encountering a character of category-code 5(end of line) while not gathering the name of a control-symbol-token TeX in any case ceases tokenizing the current line, i.e., drops any remaining characters on the current line, and starts processing the next line if present.

  • There is no indication that the job is to be finished, so TeX keeps its digestive processes going:
    There are no more characters left in the mouth, so TeX's eyes start pre-processing the next line of input. The reading-apparatus is switched to state N(new line). The single characters of the pre-processed line go to TeX's mouth on demand where tokens are formed on demand. Tokens are sent from TeX's mouth towards TeX's stomach on demand. Hereby they pass TeX's gullet where expandable tokens get expanded/get replaced by their replacement-text. In the stomach assignments take place and boxes are bulit and paragraphs are split across lines and lines are placed on pages etc...


characters are normally tokenized into a character token, using the current catcode settings, but after a character of catcode 0 is seen, it is not tokenized and the following characters are used to make a csname token.

In this case the following character is b of catcode 11 so tex will read all the following catcode 11 characters upto and including the first non-catcode 11 character, or end of line.

So here the sequence of catcode 11 characters is baz and will make a csname token with name baz the non-catcode11 character that was used to terminate the csname scan is returned to the input stream (as a character, still untokenised) unless it it is catcode 10 space character in which case it is discarded, and tex goes into its skipping blanks state, so that any following spaces are also discarded. If the scan was terminated by end of line then tex goes straight to its beginning of line state without adding the token that usually produces a space at ends of lines, and all spaces at the beginning of the next line will be discarded as usual.

so in your case the characters after \baz are } in the first definition, { in the second definition so no special space handling is involved, just in your later suggested use of explicit (bla)\baz Bar the non-catcode 11 character is a space and discarded.

When macros are expanded the replacement texts are list of tokens so none of this character to token or catcode lookup is involved at all.


Let me modify your code

\def\foo#1{(#1)\baz}
\def\baz{baz}

\foo{bla} Bar\baz Gnu

\bye

The definitions are actually irrelevant. When TeX reads the input it tokenizes it; so let's count the tokens in the relevant line:

\foo{1b11l11a11}210 • B11a11r11\bazG11n11u1110

I also added the category codes, when possible; control sequence tokens have no category code. The last space token is generated by the end line.

There is no space token after \baz, because spaces are ignored after control words during the process of tokenization.

Now TeX starts expanding macros, starting from the left. Since \foo is a one-argument macro and is followed by {1, the argument is everything up to the matching }2. Thus TeX removes all these tokens and replaces them with the replacement text saved at definition time:

(12b11l11a11)12\baz10 • B11a11r11\bazG11n11u1110

The tokens up to \baz are passed to the next stage, leaving

\baz10 • B11a11r11\bazG11n11u1110

Now \baz is a no-argument macro, so no lookup for undelimited arguments is done, which would ignore spaces; the replacement leaves

b11a11z1110 • B11a11r11\bazG11n11u1110

Note that TeX is not doing tokenization at this stage, so spaces after control sequences are not ignored.

When the macro replacement is performed, TeX uses already formed tokens; so \baz at the beginning of the third shown token list is actually the “internal” representation of the token. A following space is not ignored.

This is necessary. Suppose you have

\def\foo#1{#1 is good}
\def\egreg{EG}

Then you want that \foo{EG} or \foo\egreg print the same, independently of what's the argument passed to \foo. The parameter in the definition is followed by a space, so also after macro replacement there will be a space.


Note, The description above is a simplification of what really happens. The line is not tokenized immediately: only the part of the line that's needed is scanned. So TeX actually starts tokenizing \foo and after having found a one-argument macro, it looks for what comes along, which is an open brace, so TeX tokenizes up to finding the matching closed brace. And so on. However, since there is no category code change involved, pretending that TeX tokenizes at once the whole line is not the truth, but a good approximation to it for the task at hand.

What would be the problem in immediately tokenizing a line? Consider

\catcode`?=\active ?

If the line were tokenized immediately, the ? would be assigned category code 12 and not 13. Instead, tokenizing when the need arises solves the problem. The second ? is tokenized after the category code assignment has been performed.