Are shells allowed to ignore NUL bytes in scripts?
As per POSIX,
input file shall be a text file, except that line lengths shall be unlimited¹
NUL characters² in the input make it non-text, so the behaviour is unspecified as far as POSIX is concerned, so
sh implementations can do whatever they want (and a POSIX compliant script must not contain NULs).
There are some shells that scan the first few bytes for 0s and refuse to run the script on the assumption that you tried to execute a non-script file by mistake.
That's useful because the
find -exec... are required to call a shell to interpret a command if the system returns with ENOEXEC upon
execve(), so, if you try to execute a command for the wrong architecture, it's better to get a won't execute a binary file error from your shell than the shell trying to make sense of it as a shell script.
That is allowed by POSIX:
If the executable file is not a text file, the shell may bypass this command execution.
Which in the next revision of the standard will be changed to:
The shell may apply a heuristic check to determine if the file to be executed could be a script and may bypass this command execution if it determines that the file cannot be a script. In this case, it shall write an error message, and shall return an exit status of 126.
Note: A common heuristic for rejecting files that cannot be a script is locating a NUL byte prior to a <newline> byte within a fixed-length prefix of the file. Since sh is required to accept input files with unlimited line lengths, the heuristic check cannot be based on line length.
That behaviour can get in the way of shell self-extractable archives though which contain a shell header followed by binary data¹.
zsh shell supports NUL in its input, though note that NULs can't be passed in the arguments of
execve(), so you can only use it in the argument or names of builtin commands or functions:
$ printf '\0() echo zero; \0\necho \0\n' | zsh | hd 00000000 7a 65 72 6f 0a 00 0a |zero...| 00000007
(here defining and calling a function with NUL as its name and passing a NUL character as argument to the builtin
Some will strip them which is also a sensible thing to do.
NULs are sometimes used as padding. They are ignored by terminals for instance (they were sometimes sent to terminals to give them time to process complex control sequences (like carriage return (literally)). Holes in files appear as being filled with NULs, etc.
Note that non-text is not limited to NUL bytes. It's also sequence of bytes that don't form valid characters in the locale. For instance, the 0xc1 byte value cannot occur in UTF-8 encoded text. So in locales using UTF-8 as the character encoding, a file that contains such a byte is not a valid text file and therefore not a valid
yash is the only shell I know that will complain about such invalid input.
¹ In the next revision of the standard, it is going to change to
The input file may be of any type, but the initial portion of the file intended to be parsed according to the shell grammar (XREF to XSH 2.10.2 Shell Grammar Rules) shall consist of characters and shall not contain the NUL character. The shell shall not enforce any line length limits.
explicitly requiring shells to support input that starts with a syntactically valid section without NUL bytes, even if the rest contains NULs, to account for self-extracting archives.
² and characters are meant to be decoded as per the locale's character encoding (see the output of
locale charmap), and on POSIX system, the NUL character (whose encoding is always byte 0) is the only character whose encoding contains the byte 0. In other words, UTF-16 is not among the character encodings that can be used in a POSIX locale.
³ There is however the question of the locale changing within the script (like when the
LOCPATH variables are assigned) and at which point the change takes effect for the shell interpreting the input.