Raku Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?

Hopefully someone will have a better answer. In the meantime...


There are several very different things going on in your question.

Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?

There is supposed to be a nice, obvious, fairly simple one:

say .decode: replacement => '�'
given $buf-that's-supposed-to-be-utf8

This should decode the same way a plain slurp does, except that, instead of just giving up on the decode when it encounters "Malformed UTF-8", it should just replace malformed data with the replacement character you've specified and continue as best it can.

Unfortunately (as far as I know) this doesn't work due to bugs in rakudo/moarvm as outlined in my answer to decode with replacement does not seem to work.

I did not file an issue at the time I wrote that SO. Your new SO has prompted me to file two bug reports:

  • .decode's replacement option didn't work in Rakudo v2019.03.01 and presumably still doesn't #3509

  • decoder replacement options didn't work in Rakudo v2019.03.01 and presumably still don't #1245


Some other options are given in the answers to error message: Malformed UTF-8.

I see in your repl examples you've tried .decode('utf8-c8'). This may be your best bet within raku as it stands.


If none of the above is helpful, I think you're stuck for now with using an external tool to preprocess files before they get to raku.

Is there a predefined character class for all good utf-8 chars

utf8 data is not characters. It's just bytes. The data encodes characters, or at least it's supposed to, but it's very important to keep encodings and characters separate in your mind.

If you know how old-fashioned telegrams work, it's like that. There's a message in characters. And then morse code for transmitting it. They're very different things.

When you see "Malformed UTF-8" or similar, it means the decoder is choking on some part of the data (the bytes). They make no sense to it as characters. It's like morse code that doesn't follow the rules for morse code.

Such data is considered to be confusing crap at best and dangerous crap at worst. The Unicode standard requires that it is entirely eliminated before you can do anything with it.

The obvious friendly solution is to replace crap with a user specified replacement character as you asked. In contrast, a regex character class is both the wrong tool and too late.

Example: from REPL

This is another whole ball of wax.

There's:

  • The encoding used by your (terminal on your) local system;

  • The characters you see rendered, and the indication of the cursor, when you use your local system;

  • What's in your cut/paste buffer when you copy from your repl display;

  • What your browser does with that buffer when you paste into the edit window for an SO question;

  • What SO's servers do with that the contents of the edit window when you click the Post your question button and when SO renders your question;

  • What my local system, browser, terminal, cut/paste buffer, etc. are doing when I look at your SO question;

  • Etc.

This complexity exists even if both our systems and both you and I are doing what we're supposed to be doing. So, sure, something is amiss with the cursor and other issues, but I'm not going to try nail that down with this answer because, unlike the first part of your question I answered above, it's not really to do with raku/do.

Tags:

Regex

Char

Raku