Replace non-printable characters in perl and sed

That's a typical job for tr:

LC_ALL=C tr '\0-\10\13\14\16-\37' '[ *]' < in > out

In your case, it doesn't work with sed because you're in a locale where those ranges don't make sense. If you want to work with byte values as opposed to characters and where the order is based on the numerical value of those bytes, your best bet is to use the C locale. Your code would have worked with LC_ALL=C with GNU sed, but using sed (let alone perl) is a bit overkill here (and those \xXX are not portable across sed implementations while this tr approach is POSIX).

You can also trust your locale's idea of what printable characters are with:

tr -c '[:print:]\t\r\n' '[ *]'

But with GNU tr (as typically found on Linux-based systems), that only works in locales where characters are single-byte (so typically, not UTF-8).

In the C locale, that would also exclude DEL (0x7f) and all byte values above (not in ASCII).

In UTF-8 locales, you could use GNU sed which doesn't have the problem GNU tr has:

sed 's/[^[:print:]\r\t]/ /g' < in > out

(note that those \r, \t are not standard, and GNU sed won't recognize them if POSIXLY_CORRECT is in the environment (will treat them as backslash, r and t being part of the set as POSIX requires)).

It would not convert bytes that don't form valid characters if any though.

Tags:

Perl

Sed