sed: Ignore leading whitespace when substituting globally

$ sed 's/\>[[:blank:]]\{1,\}/ /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

The expression I used matches one or several [[:blank:]] (spaces or tabs) after a word, and replaces these with a single space. The \> matches the zero-width boundary between a word-character and a non-word-character.

This was tested with OpenBSD's native sed, but I think it should work with GNU sed as well. GNU sed also uses \b for matching word boundaries.

You could also use sed -E to shorten this to

sed -E 's/\>[[:blank:]]+/ /g' file

Again, if \> doesn't work for you with GNU sed, use \b instead.


Note that although the above sorts out your example text in the correct way, it does not quite work for removing spaces after punctuation, as after the first sentence in

     This is     an indented      paragraph.        The   indentation   should not be changed.
This is the     second   line  of the    paragraph.

For that, a slightly more complicated variant would do the trick:

$ sed -E 's/([^[:blank:]])[[:blank:]]+/\1 /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

This replaces any non-blank character followed by one or more blank characters with the non-blank character and a single space.

Or, using standard sed (and a very tiny optimization in that it will only do the substitution if there are two or more spaces/tabs after the non-space/tab),

$ sed 's/\([^[:blank:]]\)[[:blank:]]\{2,\}/\1 /g' file
     This is an indented paragraph. The indentation should not be changed.
This is the second line of the paragraph.

POSIXly:

sed 's/\([^[:space:]]\)[[:space:]]\{1,\}/\1 /g; s/[[:space:]]*$//'

Which replaces any sequence of one or more whitespace characters following a non-whitespace, with that non-whitespace and one single SPC character, and remove the trailing whitespace characters, which would cover blank lines and lines with trailing whitespace (including the CRs found at the end of lines coming from Microsoft text files).

Tags:

Sed