PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

However, you could try doing this:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

Setting it to strict might help you get a better result.

In motherland Russia we have 4 popular encodings, so your question is in great demand here.

Only by char codes of symbols you can not detect encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.

The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".

One guy here in popular Russian tech blog invented this approach:

Build the probability range of char codes in every encoding you want to support. You can build it using some big texts in your language (e.g. some fiction, use Shakespeare for english and Tolstoy for russian, lol ). You will get smth like this:

    encoding_1:
    190 => 0.095249209893009,
    222 => 0.095249209893009,
    ...
    encoding_2:
    239 => 0.095249209893009,
    207 => 0.095249209893009,
    ...
    encoding_N:
    charcode => probabilty

Next. You take text in unknown encoding and for every encoding in your "probability dictionary" you search for frequency of every symbol in unknown-encoded text. Sum probabilities of symbols. Encoding with bigger rating is likely the winner. Better results for bigger texts.

If you are interested, I can gladly help you with this task. We can greatly increase the accuracy by building two-charcodes probabilty list.

Btw. mb_detect_encoding certanly does not work. Yes, at all. Please, take a look of mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".

You've probably tried this to but why not just use the mb_convert_encoding function? It will attempt to auto-detect char set of the text provided or you can pass it a list.

Also, I tried to run:

$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);

and the results are the same for both. How do you see that your text is truncated to 'fianc'? is it in the DB or in a browser?

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

Tags:

Php

Character Encoding

Utf 8

Related

Recent Posts