tr analog for unicode characters?

GNU sed does work with multi-byte characters. So:

$ echo é½Æ | sed 'y/é½Æ/ABŒ/'
ABŒ

It's not so much that GNU tr hasn't been internationalised but that it doesn't support multi-byte characters (like the non-ASCII ones in UTF-8 locales). GNU tr would work with Æ, Œ as long as they were single-byte like in the iso8859-15 character set.

More on that at How to make tr aware of non-ascii(unicode) characters?

In any case, that has nothing to do with Linux, it's about the tr implementation on the system. Whether that system uses Linux as a kernel or tr is built for Linux or use the Linux kernel API is not relevant as that part of the tr functionality takes place in user space.

busybox tr and GNU tr are the most commonly found on distributions of software built for Linux and don't support multi-byte characters, but there are others that have been ported to Linux like the tr of the heirloom toolchest (ported from OpenSolaris) or of ast-open that do.

Note that sed's y doesn't support ranges like a-z. Also note that if that script that contains sed 'y/é½Æ/ABŒ/' is written in the UTF-8 charset, it will no longer work as expected if called in a locale where UTF-8 is not the charset.

An alternative could be to use perl:

perl -Mopen=locale -Mutf8 -pe 'y/a-zé½Æ/A-ZABŒ/'

Above, the perl code is expected to be in UTF-8, but it will process the input in the locale's encoding (and output in that same encoding). If called in a UTF-8 locale, it will transliterate a UTF-8 Æ (0xc3 0x86) to a UTF-8 Œ (0xc5 0x92) and in a ISO8859-15 same but for 0xc6 -> 0xbc.

In most shells, having those UTF-8 characters inside the single quotes should be OK even if the script is called in a locale where UTF-8 is not the charset (an exception is yash which would complain if those bytes don't form valid characters in the locale). If you're using other quoting than single-quotes, however, it could cause problems. For instance,

perl -Mopen=locale -Mutf8 -pe "y/♣\`/&'/"

would fail in a locale where the charset is BIG5-HKSCS because the encoding of \ (0x5c) also happens to be contained in some other characters there (like α: 0xa3 0x5c, and the UTF-8 encoding of ♣ happens to end in 0xa3).

In any case, don't expect things like

perl -Mopen=locale -Mutf8 -pe 'y/Á-Ź/A-Z/'

to work at removing acute accents. The above is actually just

perl -Mopen=locale -Mutf8 -pe 'y/\x{c1}-\x{179}/\x{41}-\x{5a}/'

That is, the range is based on the unicode codepoints. So ranges won't be useful outside of very well defined sequences that happen to be in the "right" order in Unicode like A-Z, 0-9.

If you want to remove acute accents, you'd have to use more advanced tools like:

perl -Mopen=locale -MUnicode::Normalize -pe '
  $_ = NFKD($_); s/\x{301}//g; $_ = NFKC($_)'

That is use Unicode normalisation forms to decompose characters, remove the acute accents (here the combining form U+0301) and recompose.

Another useful tool to transliterate Unicode is uconv from ICU. For instance, the above could also be written as:

uconv -x '::NFKD; \u0301>; ::NFKC;'

Though would only work on UTF-8 data. You'd need:

iconv -t utf-8 | uconv -x '::NFKD; \u0301>; ::NFKC;' | iconv -f utf-8

To be able to process data in the user's locale.

tr analog for unicode characters?

Tags:

Unicode

Utilities

Tr

Related

Recent Posts