Incompatibility between verbatim and tabu? (Danger of using \scantokens in a package)

UPDATE 2019-01-14

An equivalent patch has been applied in tabu 2.9 which has been submitted to ctan.


EDIT: After discussion in comments, it turns out that I had underestimated the problem, and that David's answer was closer to being the correct answer than mine.

Three catcode régimes are involved here. In chronological order, those are:

  1. catcodes in force when the code of the tabu package is read (tokenized);
  2. catcodes in force when the preamble of the tabu environment is tokenized;
  3. catcodes in force when the tabu is performed (here, in the verbatim environment).

The tabu package (v2.8) assumes that the catcode régimes 2 and 3 are identical (but it misuses \scantokens, which should only ever be used in combination with \everyeof —see below for the appropriate code—). Specifically, it tries to parse the preamble (which has catcodes in régime 2) using macros delimited by | in régime 3. When used as in the question, the tabu preamble is saved early on (with normal catcodes), and the tabu is performed when verbatim catcodes are in force. In that case, the catcode régime 2 actually coincides with the catcode régime 1, hence David's suggestion of disabling \scantokens is correct, since tabu then parses the preamble with a macro delimited by a régime 1 |.

In general, however, both solutions may fail if the three catcode régimes are distinct, which happens for instance if | is declared as a shorthand character for verbatim. In that case, the simplest approach is to use David's suggestion while making sure that the tabu preamble is tokenized with the category codes in place when the tabu package code is read, hence normal category codes. For example, removing the \DeleteShortVerb (and subsequent \MakeShortVerb) lines from the code below will fail because tabu fails to recognize the active | in the preamble.

\documentclass{article}
\usepackage{verbatim}
\usepackage{tabu}
\usepackage{shortvrb}
\MakeShortVerb{\|}

\begin{document}

We first input the file \jobname.tex with
|\verbatiminput{\jobname.tex}|:

\verbatiminput{\jobname.tex}%

Then redefine |\verbatim@processline|
%
\makeatletter
\DeleteShortVerb{\|}
\renewcommand\verbatim@processline
{{\let\scantokens\@firstofone
  \begin{tabu}to\textwidth{|[5pt]l|X[-1,l]|}%
    foo&\the\verbatim@line%
  \end{tabu}%
}\par}
\MakeShortVerb{\|}
\makeatother
%
and input the file again with the same command:

\verbatiminput{\jobname.tex}%

\end{document}

The fully correct fix would be to change completely the way a tabu preamble is parsed, replacing the current approach (which comes from LaTeXe's * through array's \newcolumntype) by an approach which reads characters in the preamble from left to right, ignores their catcode, checks if they are a "primitive" column type or should be expanded to something else, checks for arguments for those column types, and when it is done, goes to the next token in the preamble.


The eTeX primitive \scantokens is very tricky to use properly, and tabu misuses it (and in many places). This is clearly a bug of tabu, and is fixable.

Rather than

\scantokens{\def\:{|}} % bad

which is risky because \def\: is also rescanned (and braces too), it is better to do

\everyeof{\noexpand}
\edef\:{\expandafter\noexpand\scantokens{|}}

namely put only the part that needs to be rescanned in the brace group. The \edef ensures that \scantokens is expanded, and setting \everyeof to \noexpand prevents the end-of-file marker at the end of \scantokens to wreak havoc. The additional \expandafter\noexpand construction is only needed to support the case where | is currently active. The case where | is a macro parameter character, or a begin or end-group token, would break that code, but that is probably unavoidable. Of course, to use \scantokens properly, one also needs to take care of the \endlinechar (which tabu does), and the \newlinechar (in case that is set to |), hence the correct fix for your situation is

\renewcommand{\tabu@textbar}[1]%
  {%
    \begingroup
      \newlinechar \m@ne % I'm just paranoid.
      \endlinechar \m@ne
      \everyeof{\noexpand}%
      \edef\:{\expandafter\noexpand\scantokens{|}}%
      \expandafter
    \endgroup
    \expandafter #1%
    \:%
  }

Now, in my solution I make use of the fact that tabu's author only wants to rescan a single character here. What should he do when rescanning a full token list? Well, this is more tricky, always because TeX inserts a marker at the end of every file (including the \scantokens file), which behaves as an \outer "thing" preventing a macro appearing in one file to have its argument in a different file, for instance. The answer can be found in the implementation of \tl_set_rescan:Nnn in LaTeX3, or in one of Heiko Oberdiek's packages (dunno which one, reference welcome). Build a marker that cannot appear when rescanning (e.g., two @ with different catcodes), and set that as the end-of-file marker. Then define a macro with an argument delimited by that marker, to collect the rescanned token list. For instance,

\def\tabu@tmp#1%
  {%
    \long\def\tabu@gdef@rescan@##1#1%
      {\expandafter{##1}}%
    \long\def\tabu@gdef@rescan##1##2%
      {%
        \begingroup
          \newlinechar\m@ne
          \endlinechar\m@ne
          \everyeof{#1\noexpand}%
          \xdef##1%
            {%
              \unexpanded
                \expandafter\tabu@gdef@rescan@
                \expandafter\empty
                \scantokens{##2}%
            }%
        \endgroup
      }%
  }
\expandafter\tabu@tmp\expandafter{\string @@}

UPDATE 2019-01-14

An patch equivalent to the code in Bruno's answer has been applied in tabu 2.9 which has been submitted to ctan, so the workaround suggested in this answer should not be needed.


enter image description here

tabu uses \scantokens while parsing the preamble, which means it picks up the local verbatim setting and goes wrong. Since the argument is just \def\:{|} just read them with the normal catcodes. Also you need a \par or it all comes out on one line.

\documentclass{article}
\usepackage{verbatim}
\usepackage{tabu}

\makeatletter
\renewcommand\verbatim@processline
{{\let\scantokens\@firstofone
  \begin{tabu}to\textwidth{|l|X[-1,l]|}%
    foo&\the\verbatim@line%
  \end{tabu}%
}\par}
\makeatother

\begin{document}
%\tracingall
\verbatiminput{test.tex}%

\end{document}

As noted in the discussion in comments in Bruno's answer, disabling \scantokens here is only a partial fix for the special case of verbatim usage.

There are several catcode regimes that come into play in code such as this.

The catcodes in force at the time the array preamble is saved in the users macro. The catcodes in force during the body of the table (verbatim settings in this case) The catcodes in force when the tabu internals are read.

Disabling scantokens only works if the first and last of these are the same, which is the usual case. the tabu usage of scantokens tries to normalise the preamble using \scantokens but this assumes that the preamble has been saved with the catcodes in force when the table is executed which is not the case if the table preamble is stored in a macro rather than just being inline in the document.

Ideally a table preamble parsing code ought to be agnostic about catcodes (that is accept | as a vertical rule specification whatever catcode is used) or if it is using scantokens it should probably normalise the entire array preamble with a safe catcode regime