Security in autogenerated latex scripts. How to avoid LaTeX Injection?

What you are describing is basically just verbatim input with no user-accessible delimiter. You can define commands with verbatim parameters e.g. with xparse. (Compile with LuaLaTeX in order to avoid encoding issues):

\documentclass{article}
\usepackage{xparse}
\NewDocumentCommand\untrustedInput{+v}{#1}
\begin{document}
    \untrustedInput|Do don't have to trust this input. This can be \something_evil
    
    
    and everything is just interpreted as text.|
\end{document}

In this example, the verbatim block is delimited by + which of course would be insecure because the untrusted data might contain a +. But you can use any codepoint you want as delimiter, so you just have to choose one which is not allowed in your input. A good candidate would be an invalid Unicode codepoint like U+D800 (UTF-8 encoded as 0xED 0xA0 0x80) You can first scan you input for this byte sequence. If it appears, the encoding is invalid and you can directly issue an error. (U+D800 is a UTF-16 high surrogate and never allowed in UTF-8 data) Otherwise put the three bytes 0xED 0xA0 0x80 on both sides of your input and pass the text as argument for \untrustedInput to LuaTeX. (LuaTeX doesn't care that D800 is invalid as long as you don't try to actually typeset it.)

The command \untrustedInput will not be usable inside other arguments. That can not be directly avoided because the other argument would try to interpret the text first, potentially interpreting dangerous characters. But you can use the command to save your untrusted text into a macro which can be used for freely: (Example again with +)

\documentclass{article}
\usepackage{xparse}
\NewDocumentCommand\defineWithUntrustedInput{m +v}{\newcommand#1{#2}}
\begin{document}
  \defineWithUntrustedInput\theText+Do don't have to trust this input. This can be \something_evil


  and everything is just interpreted as text.+
  \textit{\theText}
\end{document}

A naive starter with LuaLaTeX:

ignore everything

Please note that this has quite a few caveats: Everything is catcode 12 except spaces (catcode 10). As you will see in this example, paragraphs are ignored. Many characters will depend on the font you use. But again, this is intended as a starter.

% arara: lualatex
\documentclass{article}

\newcommand\getmyevildatabase{%
  \directlua{
    local file = io.open("evil.txt")
    if file then
      local content = file:read("*all")
      file:close()
      tex.print(-2, content)
    end
  }}

\begin{document}
Test here

\getmyevildatabase

Another test
\end{document}

with evil.txt:

This \bye test is evil ? ^ ²³¼ þ Þ ’¢“„ % {quack}

¿?

Tags:

Security