sed: read whole file into pattern space without failing on single-line input

There are all kinds of reasons why reading a whole file into pattern space can go wrong. The logic problem in the question surrounding the last line is a common one. It is related to sed's line cycle - when there are no more lines and sed encounters EOF it is through - it quits processing. And so if you are on the last line and you instruct sed to get another it's going to stop right there and do no more.

That said, if you really need to read a whole file into pattern space, then it is probably worth considering another tool anyway. The fact is, sed is eponymously the stream editor - it is designed to work a line - or a logical data block - at a time.

There are many similar tools that are better equipped to handle full file blocks. ed and ex, for example, can do much of what sed can do and with similar syntax - and much else besides - but rather than operating only on an input stream while transforming it to output as sed does, they also maintain temporary backup files in the file-system. Their work is buffered to disk as needed, and they do not quit abruptly at end of file (and tend to implode a lot less often under buffer strain). Moreover they offer many useful functions which sed does not - of the sort that simply do not make sense in a stream context - like line marks, undo, named buffers, join, and more.

sed's primary strength is its ability to process data as soon as it reads it - quickly, efficiently, and in stream. When you slurp a file you throw that away and you tend to run into edge case difficulties like the last line problem you mention, and buffer overruns, and abysmal performance - as the data it parses grows in length a regexp engine's processing time when enumerating matches increases exponentially.

Regarding that last point, by the way: while I understand the example s/a/A/g case is very likely just a naive example and is probably not the actual script you want to gather in an input for, you might might find it worth your while to familiarize yourself with y///. If you often find yourself globally substituting a single character for another, then y could be very useful for you. It is a transformation as opposed to a substitution and is far quicker as it does not imply a regexp. This latter point can also make it useful when attempting to preserve and repeat empty // addresses because it does not affect them but can be affected by them. In any case, y/a/A/ is a more simple means of accomplishing the same - and swaps are possible as well like: y/aA/Aa/ which would interchange all upper/lowercase as on a line for each other.

You should also note that the behavior you describe is really not what is supposed to happen anyway.

From GNU's info sed in the COMMONLY REPORTED BUGS section:

  • N command on the last line

    • Most versions of sed exit without printing anything when the N command is issued on the last line of a file. GNU sed prints pattern space before exiting unless of course the -n command switch has been specified. This choice is by design.

    • For example, the behavior of sed N foo bar would depend on whether foo has an even or an odd number of lines. Or, when writing a script to read the next few lines following a pattern match, traditional implementations of sed would force you to write something like /foo/{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N } instead of just /foo/{ N;N;N;N;N;N;N;N;N; }.

    • In any case, the simplest workaround is to use $d;N in scripts that rely on the traditional behavior, or to set the POSIXLY_CORRECT variable to a non-empty value.

The POSIXLY_CORRECT environment variable is mentioned because POSIX specifies that if sed encounters EOF when attempting an N it should quit without output, but the GNU version intentionally breaks with the standard in this case. Note also that even as the behavior is justified above the assumption is that the error case is one of stream-editing - not slurping a whole file into memory.

The standard defines N's behavior thus:

  • N

    • Append the next line of input, less its terminating \newline, to the pattern space, using an embedded \newline to separate the appended material from the original material. Note that the current line number changes.

    • If no next line of input is available, the N command verb shall branch to the end of the script and quit without starting a new cycle or copying the pattern space to standard output.

On that note, there are some other GNU-isms demonstrated in the question - particularly the use of the :label, branch, and { function-context brackets }. As a rule of thumb any sed command which accepts an arbitrary parameter is understood to delimit at a \newline in the script. So the commands...

:arbitrary_label_name; ...
b to_arbitrary_label_name; ...
//{ do arbitrary list of commands } ...

...are all very likely to perform erratically depending on the sed implementation that reads them. Portably they should be written:

...;:arbitrary_label_name
...;b to_arbitrary_label_name
//{ do arbitrary list of commands
}

The same holds true for r, w, t, a, i, and c (and possibly a few more that I'm forgetting at the moment). In almost every case they might also be written:

sed -e :arbitrary_label_name -e b\ to_arbitary_label_name -e \
    "//{ do arbitrary list of commands" -e \}

...where the new -execution statement stands in for the \newline delimiter. So where the GNU info text suggests a traditional sed implementation would force you to do:

/foo/{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N }

...it should rather be...

/foo/{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N
}

...of course, that isn't true either. Writing the script in that way is a little silly. There are much more simple means of doing the same, like:

printf %s\\n foo . . . . . . |
sed -ne 'H;/foo/h;x;//s/\n/&/3p;tnd
         //!g;x;$!d;:nd' -e 'l;$a\' \
     -e 'this is the last line' 

...which prints:

foo
.
.
.
foo\n.\n.\n.$
.$
this is the last line

...because the test command - like most sed commands - depends on the line cycle to refresh its return register and here the line cycle is permitted to do most of the work. That's another tradeoff you make when you slurp a file - the line cycle doesn't refresh ever again, and so many tests will behave abnormally.

The above command doesn't risk over-reaching input because it just does some simple tests to verify what it reads as it reads it. With Hold all lines are appended to the hold space, but if a line matches /foo/ it overwrites hold space. The buffers are next exchanged, and a conditional s///ubstitution is attempted if the contents of the buffer match the //last pattern addressed. In other words //s/\n/&/3p attempts to replace the third newline in hold space with itself and print the results if hold space currently matches /foo/. If that tests successful the script branches to the not delete label - which does a look and wraps up the script.

In the case that both /foo/ and a third newline cannot be matched together in hold space though, then //!g will overwrite the buffer if /foo/ is not matched, or, if it is matched, it will overwrite the buffer if a \newline is not matched (thereby replacing /foo/ with itself). This little subtle test keeps the buffer from filling up unnecessarily for long stretches of no /foo/ and ensures the process stays snappy because the input does not pile on. Following on in a no /foo/ or //s/\n/&/3p fail case the buffers are again swapped and every line but the last is there deleted.

That last - the last line $!d - is a simple demonstration of how a top-down sed script can be made to handle multiple cases easily. When your general method is to prune away unwanted cases starting with the most general and working toward the most specific then edge cases can be more easily handled because they are simply allowed to fall through to the end of the script with your other wanted data and when it all wraps you're left with only the data you want. Having to fetch such edge cases out of a closed loop can be far more difficult to do, though.

And so here's the last thing I have to say: if you must really pull in an entire file, then you can stand to do a little less work by relying on the line cycle to do it for you. Typically you would use Next and next for lookahead - because they advance ahead of the line cycle. Rather than redundantly implementing a closed loop within a loop - as the sed line cycle is just an simple read loop anyway - if your purpose is only to gather input indiscriminately, then it is probably easier to do:

sed 'H;1h;$!d;x;...'

...which will gather the entire file or go bust trying.


a side note about N and last line behavior...

while i do not have the tools available to me to test, consider that N when reading and in-place editing behaves differently if the file edited is the script file for next readthrough.


It fails because the N command comes before the pattern match $! (not last line) and sed quits before doing any work:

N

Add a newline to the pattern space, then append the next line of input to the pattern space. If there is no more input then sed exits without processing any more commands.

This can be easily fixed to work with single-line input as well (and indeed to be more clear in any case) by simply grouping the N and b commands after the pattern:

sed ':a;$!{N;ba}; [commands...]'

It works as follows:

  1. :a create a label named 'a'
  2. $! if not the last line, then
  3. N append the next line to the pattern space (or quit if there is no next line) and ba branch (go to) label 'a'

Unfortunately, it's not portable (as it relies on GNU extensions), but the following alternative (suggested by @mikeserv) is portable:

sed 'H;1h;$!d;x; [commands...]'

Tags:

Sed

Newlines