Delete all but the last comment line for each comment block

The problem is that .* is greedy and so sed -z -e 's/#.*\n#/#/g' will match from the very first line containing # up to the last line starting with #. This only happens because of the -z flag, that slurps the whole file in the pattern space at once (assuming no null bytes in the text file).

The Sed script to solve your problem is

sed -n '/^#/N;/\n#/D;p' file
  • /^#/N If the line begins with #, append the next line to the pattern space.
  • /\n#/D If the pattern space contains a newline followed by #, delete all up to the newline and start a new cycle.
  • p Print the pattern space if this command is reached.

Useful links

  • POSIX specification of Sed commands
  • GNU Sed manual, multiline techiniques
  • Grymoire, more multiline examples

You obviously want to remove all comment lines that are followed by other comment lines from your input. The sed call fails because regular expressions are by default "greedy" (i.e. consume as much as possible), which cannot be easily changed.

So I will add an awk-based solution to the stated goal:

awk '/^#/{buf=$0;next} {if (buf) {print buf; buf=""}}1' "${InputP}"

or, slightly more compact:

awk '/^#/{buf=$0;next} buf{print buf; buf=""}1' "${InputP}"
  • This will print all lines that are not comment lines unchanged (the 1 outside the rule blocks means "print the current line, including all modifications made so far" - which is none in this case).
  • If a comment line is encountered (the line matches the patten /^#/), the content will be stored in a buffer buf, but not yet printed. The next command skips execution to the next line so the remaining code only applies to non-comment lines.
  • If a non-comment line is encountered, the buffer content is printed first (if any) and the buffer emptied (to prevent multiple printout) before the actual line content is printed.

Using GNU sed with slurp mode -z and utilizing extended regexes -E we can do as shown:

$ sed -Ez '
    s/(^|\n)(#[^\n]*\n)+$/\1/
    s/(^|\n)(#[^\n]*\n)+/\1\2/g
' file
  • Remove a trailing comment block.
  • Remove all comment blocks but keep the last line in each.

The GNU sed model is as follows:

  • Sed reads a file line by line unless -z is in effect, when it reads the whole file. The record separator by default is a newline \n unless -z is in use then it is \0 the NULL ascii.
  • After reading in a record, the trailing record separator is clipped and the resulting string is stored in the pattern space register. The pattern space is where all the sed commands operate.
  • Now let's say there are 5 sed commands in our sed script. Then the first one is applied on the pattern space, this modifies the pattern space and on this modified pattern space the next sed command is applied ... and so forth sequentially till the last. Then the pattern space is printed to stdout unless the -n is in effect. After this the next record is read in and the same sequence of sed commands are applied to the pattern space.

Please note that the above is a very simplified narrative, valid when no flow control commands are used in the sed script.

Yes you are right, in the slurp mode the $ signifies the end of file as also the end of pattern space since there is just one pattern space.

When you have this construct (regex)+ then the brackets hold the last regex match because of the greedy nature of regexes.

Alternatively, it can also be done as

$ sed -e '
    /^#/{h;d;} 
    H;z;x;s/^\n//
' file