Why some shells `read` builtin fail to read the whole line from file in `/proc`?

The problem is that those /proc files on Linux appear as text files as far as stat()/fstat() is concerned, but do not behave as such.

Because it's dynamic data, you can only do one read() system call on them (for some of them at least). Doing more than one could get you two chunks of two different contents, so instead it seems a second read() on them just returns nothing (meaning end-of-file) (unless you lseek() back to the beginning (and to the beginning only)).

The read utility needs to read the content of files one byte at a time to be sure not to read past the newline character. That's what dash does:

 $ strace -fe read dash -c 'read a < /proc/sys/fs/file-max'
 read(0, "1", 1)                         = 1
 read(0, "", 1)                          = 0

Some shells like bash have an optimisation to avoid having to do so many read() system calls. They first check whether the file is seekable, and if so, read in chunks as then they know they can put the cursor back just after the newline if they've read past it:

$ strace -e lseek,read bash -c 'read a' < /proc/sys/fs/file-max
lseek(0, 0, SEEK_CUR)                   = 0
read(0, "1628689\n", 128)               = 8

With bash, you'd still have problems for proc files that are more than 128 bytes large and can only be read in one read system call.

bash also seems to disable that optimization when the -d option is used.

ksh93 takes the optimisation even further so much as to become bogus. ksh93's read does seek back, but remembers the extra data it has read for the next read, so the next read (or any of its other builtins that read data like cat or head) doesn't even try to read the data (even if that data has been modified by other commands in between):

$ seq 10 > a; ksh -c 'read a; echo test > a; read b; echo "$a $b"' < a
1 2
$ seq 10 > a; sh -c 'read a; echo test > a; read b; echo "$a $b"' < a
1 st

If you are interested in knowing why? this is so, you can see the answer in the kernel sources here:

    if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
            *lenp = 0;
            return 0;
    }

Basically, seeking (*ppos not 0) is not implemented for reads (!write) of sysctl values that are numbers. Whenever a read is done from /proc/sys/fs/file-max, the routine in question __do_proc_doulongvec_minmax() is called from the entry for file-max in the configuration table in the same file.

Other entries, such as /proc/sys/kernel/poweroff_cmd are implemented via proc_dostring() which does allow seeks, so you can do dd bs=1 on it and read from your shell with no problems.

Note that since kernel 2.6 most /proc reads were implemented via a new API called seq_file and this supports seeks so eg reading /proc/stat should not cause problems. The /proc/sys/ implementation, as we can see, does not use this api.


On the first attempt, this looks like a bug in the shells that return less than a real Bourne Shell or its derivatives return (sh, bosh, ksh, heirloom).

The original Bourne Shell tries to read a block (64 bytes) newer Bourne Shell variants read 128 bytes, but they start reading again if there is no new line character.

Background: /procfs and similar implementations (e.g. the mounted /etc/mtab virtual file) have dynamic content and a stat() call does not cause the re-creation of the dynamic content first. For this reason, the size of such a file (from reading until EOF) may differ from what stat() returns.

Given that the POSIX standard requires utilities to expect short reads at any time, software that believes that a read() that returns less than the ordered amount of bytes is an EOF indication are broken. A correctly implemented utility calls read() a second time in case that it returns less than expected - until a 0 is returned. In case of the read builtin, it would of course be sufficient to read until EOF or until a NL is seen.

If you run truss or a truss clone, you should be able to verify that incorrect behavior for the shells that only return 6 in your experiment.

In this special case, it seems to be a Linux kernel bug, see:

$ sdd -debug bs=1 if= /proc/sys/fs/file-max 
Simple copy ...
readbuf  (3, 12AC000, 1) = 1
writebuf (1, 12AC000, 1)
8readbuf  (3, 12AC000, 1) = 0

sdd: Read  1 records + 0 bytes (total of 1 bytes = 0.00k).
sdd: Wrote 1 records + 0 bytes (total of 1 bytes = 0.00k).

The Linux kernel returns 0 with the second read and this is of course incorrect.

Conclusion: Shells that first try to read a large enough chunk of data do not trigger this Linux kernel bug.