head eats extra characters

Is it correct for the head utility to consume more characters from the input stream than it was asked?

Yes, it’s allowed (see below).

Is there some kind of standard for Unix utilities?

Yes, POSIX volume 3, Shell & Utilities.

And if there is, does it specify this behavior?

It does, in its introduction:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

head is one of the standard utilities, so a POSIX-conforming implementation has to implement the behaviour described above.

GNU head does try to leave the file descriptor in the correct position, but it’s impossible to seek on pipes, so in your test it fails to restore the position. You can see this using strace:

$ echo -e "aaa\nbbb\nccc\nddd\n" | strace head -n 1
...
read(0, "aaa\nbbb\nccc\nddd\n\n", 8192) = 17
lseek(0, -13, SEEK_CUR)                 = -1 ESPIPE (Illegal seek)
...

The read returns 17 bytes (all the available input), head processes four of those and then tries to move back 13 bytes, but it can’t. (You can also see here that GNU head uses an 8 KiB buffer.)

When you tell head to count bytes (which is non-standard), it knows how many bytes to read, so it can (if implemented that way) limit its read accordingly. This is why your head -c 5 test works: GNU head only reads five bytes and therefore doesn’t need to seek to restore the file descriptor’s position.

If you write the document to a file, and use that instead, you’ll get the behaviour you’re after:

$ echo -e "aaa\nbbb\nccc\nddd\n" > file
$ < file (while true; do head -n 1; head -n 1 >/dev/null; done)
aaa
ccc

from POSIX

The head utility shall copy its input files to the standard output, ending the output for each file at a designated point.

It doesn't say anything about how much head must read from the input. Demanding it to read byte-by-byte would be silly, as it would be extremely slow in most cases.

This is, however, addressed in the read builtin/utility: all shells I can find read from pipes one byte at a time and the standard text can be interpreted to mean that this must be done, to be able read just that one single line:

The read utility shall read a single logical line from standard input into one or more shell variables.

In case of read, which is used in shell scripts, a common use case would be something like this:

read someline
if something ; then 
    someprogram ...
fi

Here, the standard input of someprogram is the same as that of the shell, but it can be expected that someprogram gets to read everything that comes after the first input line consumed by the read and not whatever was left over after a buffered read by read. On the other hand, using head as in your example is much more uncommon.

If you really want to delete every other line, it would be better (and faster) to use some tool that can handle the whole input in one go, e.g.

$ seq 1 10 | sed -ne '1~2p'   # GNU sed
$ seq 1 10 | sed -e 'n;d'     # works in GNU sed and the BSD sed on macOS

$ seq 1 10 | awk 'NR % 2' 
$ seq 1 10 | perl -ne 'print if $. % 2'

head eats extra characters

Tags:

Pipe

Text Processing

Utilities

Head

Shell Script

Related

Recent Posts