Detect if a string was double-encoded in UTF-8

In principle you can't, especially allowing for cat-garbage.

You don't say what the original character encoding of the data was before it was UTF-8 encoded once or twice. I'll assume CP1251, (or at least that CP1251 is one of the possibilities) because it's quite a tricky case.

Take a non-ASCII character. UTF-8 encode it. You get some bytes, and all those bytes are valid characters in CP1251 unless one of them happens to be 0x98, the only hole in CP1251.

So, if you convert those bytes from CP1251 to UTF-8, the result is exactly the same as if you'd correctly UTF-8 encoded a CP1251 string consisting of those Russian characters. There's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters.

If you have some control over the original data, you could put a BOM at the start of it. Then when it comes back to you, inspect the initial bytes to see whether you have a UTF-8 BOM, or the result of incorrectly double-encoding a BOM. But I guess you probably don't have that kind of control over the original text.

In practice you can guess - UTF-8 decode it and then:

(a) look at the character frequencies, character pair frequencies, numbers of non-printable characters. This might allow you to tentatively declare it nonsense, and hence possibly double-encoded. With enough non-printable characters it may be so nonsensical that you couldn't realistically type it even by mashing at the keyboard, unless maybe your ALT key was stuck.

(b) attempt the second decode. That is, starting from the Unicode code points that you got by decoding your UTF-8 data, first encode it to CP1251 (or whatever) and then decode the result from UTF-8. If either step fails (due to invalid sequences of bytes), then it definitely wasn't double-encoded, at least not using CP1251 as the faulty interpretation.

This is more or less what you do if you have some bytes that might be UTF-8 or might be CP1251, and you don't know which.

You'll get some false positives for single-encoded cat-garbage indistinguishable from double-encoded data, and maybe a very few false negatives for data that was double-encoded but that after the first encode by fluke still looked like Russian.

If your original encoding has more holes in it than CP1251 then you'll have fewer false negatives.

Character encodings are hard.

Here's a PHP algorithm that worked for me.

It's better to fix your data but if you can't here's a trick:

if ( mb_detect_encoding( utf8_decode( $value ) ) === 'UTF-8' ) {
    // Double encoded, or bad encoding
    $value = utf8_decode( $value );
}

$value = \ForceUTF8\Encoding::toUTF8( $value );

The library I'm using is: https://github.com/neitanod/forceutf8/

Detect if a string was double-encoded in UTF-8

Tags:

Language Agnostic

Unicode

Utf 8

Related

Recent Posts