Understanding "IFS= read -r line"

In POSIX shells, read, without any option doesn't read a line, it reads words from a (possibly backslash-continued) line, where words are $IFS delimited and backslash can be used to escape the delimiters (or continue lines).

The generic syntax is:

read word1 word2... remaining_words

read reads stdin one byte at a time¹ until it finds an unescaped newline character (or end-of-input), splits that according to complex rules and stores the result of that splitting into $word1, $word2... $remaining_words.

For instance on an input like:

  <tab> foo bar\ baz   bl\ah   blah\
whatever whatever

and with the default value of $IFS, read a b c would assign:

$a ⇐ foo
$b ⇐ bar baz
$c ⇐ blah blahwhatever whatever

Now if passed only one argument, that doesn't become read line. It's still read remaining_words. Backslash processing is still done, IFS whitespace characters are still removed from the beginning and end.

The -r option removes the backslash processing. So that same command above with -r would instead assign

$a ⇐ foo
$b ⇐ bar\
$c ⇐ baz bl\ah blah\

Now, for the splitting part, it's important to realise that there are two classes of characters for $IFS: the IFS whitespace characters (namely space and tab (and newline, though here that doesn't matter unless you use -d), which also happen to be in the default value of $IFS) and the others. The treatment for those two classes of characters is different.

With IFS=: (: being not an IFS whitespace character), an input like :foo::bar:: would be split into "", "foo", "", bar and "" (and an extra "" with some implementations though that doesn't matter except for read -a). While if we replace that : with space, the splitting is done into only foo and bar. That is leading and trailing ones are ignored, and sequences of them are treated like one. There are additional rules when whitespace and non-whitespace characters are combined in $IFS. Some implementations can add/remove the special treatment by doubling the characters in IFS (IFS=:: or IFS=' ').

So here, if we don't want the leading and trailing unescaped whitespace characters to be stripped, we need to remove those IFS white space characters from IFS.

Even with IFS-non-whitespace characters, if the input line contains one (and only one) of those characters and it's the last character in the line (like IFS=: read -r word on a input like foo:) with POSIX shells (not zsh nor some pdksh versions), that input is considered as one foo word because in those shells, the characters $IFS are considered as terminators, so word will contain foo, not foo:.

So, the canonical way to read one line of input with the read builtin is:

IFS= read -r line

(note that for most read implementations, that only works for text lines as the NUL character is not supported except in zsh).

Using var=value cmd syntax makes sure IFS is only set differently for the duration of that cmd command.

History note

The read builtin was introduced by the Bourne shell and was already to read words, not lines. There are a few important differences with modern POSIX shells.

The Bourne shell's read didn't support a -r option (which was introduced by the Korn shell), so there's no way to disable backslash processing other than pre-processing the input with something like sed 's/\\/&&/g' there.

The Bourne shell didn't have that notion of two classes of characters (which again was introduced by ksh). In the Bourne shell all characters undergo the same treatment as IFS whitespace characters do in ksh, that is IFS=: read a b c on an input like foo::bar would assign bar to $b, not the empty string.

In the Bourne shell, with:

var=value cmd

If cmd is a built-in (like read is), var remains set to value after cmd has finished. That's particularly critical with $IFS because in the Bourne shell, $IFS is used to split everything, not only the expansions. Also, if you remove the space character from $IFS in the Bourne shell, "$@" no longer works.

In the Bourne shell, redirecting a compound command causes it to run in a subshell (in the earliest versions, even things like read var < file or exec 3< file; read var <&3 didn't work), so it was rare in the Bourne shell to use read for anything but user input on the terminal (where that line continuation handling made sense)

Some Unices (like HP/UX, there's also one in util-linux) still have a line command to read one line of input (that used to be a standard UNIX command up until the Single UNIX Specification version 2).

That's basically the same as head -n 1 except that it reads one byte at a time to make sure it doesn't read more than one line. On those systems, you can do:

line=`line`

Of course, that means spawning a new process, execute a command and read its output through a pipe, so a lot less efficient than ksh's IFS= read -r line, but still a lot more intuitive.

^{¹ though on seekable input, some implementations can revert to reading by blocks and seek-back afterwards as an optimisation. ksh93 goes even further and remembers what was read and uses it for the next read invocation, though that's currently broken}

The Theory

There are two concepts that are in play here :

IFS is the Input Field Separator, which means the string read will be split based on the characters in IFS. On a command line, IFS is normally any whitespace characters, that's why the command line splits at spaces.
Doing something like VAR=value command means "modify the environment of command so that VAR will have the value value". Basically, the command command will see VAR as having the value value, but any command executed after that will still see VAR as having its previous value. In other words, that variable will be modified only for that statement.

In this case

So when doing IFS= read -r line, what you are doing is setting IFS to an empty string (no character will be used to split, therefore no splitting will occur) so that read will read the entire line and see it as one word that will be assigned to the line variable. The changes to IFS only affect that statement, so that any following commands won't be affected by the change.

As a side note

While the command is correct and will work as intended, setting IFS in this case ~~is not~~ might¹ not be necessary. As written in the bash man page in the read builtin section :

One line is read from the standard input [...] and the first word is assigned to the first name, the second word to the second name, and so on, with leftover words and their intervening separators assigned to the last name. If there are fewer words read from the input stream than names, the remaining names are assigned empty values. The characters in IFS are used to split the line into words. [...]

Since you only have the line variable, every words will be assigned to it anyway, so if you don't need any of the preceding and trailing whitespace characters¹ you could just write read -r line and be done with it.

[1] Just as an example of how an unset or default $IFS value will cause read to regard leading/trailing IFS whitespace, you might try:

echo ' where are my spaces? ' | { 
    unset IFS
    read -r line
    printf %s\\n "$line"
} | sed -n l

Run it and you will see that the preceding and trailing characters won't survive if IFS is not unset. Furthermore, some strange things could happen if $IFS was to be modified somewhere earlier in the script.

You should read that statement in two parts, the first one clears the value of the IFS variable, i.e. is equivalent to the more readable IFS="", the second one is reading the line variable from stdin, read -r line.

What is specific in this syntax is the IFS affectation is transcient and only valid for the read command.

Unless I'm missing something, in that particular case clearing IFS has no effect though as whatever IFS is set to, the whole line will be read in the line variable. There would have been a change in behavior only in the case more than one variable had been passed as parameter to the read instruction.

Edit:

The -r is there to allow input ending with \ not to be processed specially, i.e. for the backslash to be included in the line variable and not as a continuation character to allow multi-line input.

$ read line; echo "[$line]"   
abc\
> def
[abcdef]
$ read -r line; echo "[$line]"  
abc\
[abc\]

Clearing IFS has the side effect of preventing read to trim potential leading and trailing space or tab characters, eg :

$ echo "   a b c   " | { IFS= read -r line; echo "[$line]" ; }   
[   a b c   ]
$ echo "   a b c   " | { read -r line; echo "[$line]" ; }     
[a b c]

Thanks to rici for pointing that difference.

Understanding "IFS= read -r line"

History note

The Theory

In this case

As a side note

Tags:

Bash

Shell Script

Related

Recent Posts