PHP unserialize fails with non-encoded characters?

Lionel Chan answer modified to work with PHP >= 5.5 :

function mb_unserialize($string) {
    $string2 = preg_replace_callback(
        '!s:(\d+):"(.*?)";!s',
        function($m){
            $len = strlen($m[2]);
            $result = "s:$len:\"{$m[2]}\";";
            return $result;

        },
        $string);
    return unserialize($string2);
}    

This code uses preg_replace_callback as preg_replace with the /e modifier is obsolete since PHP 5.5.


I know this was posted like one year ago, but I just have this issue and come across this, and in fact I found a solution for it. This piece of code works like charm!

The idea behind is easy. It's just helping you by recalculating the length of the multibyte strings as posted by @Alix above.

A few modifications should suits your code:

/**
 * Mulit-byte Unserialize
 *
 * UTF-8 will screw up a serialized string
 *
 * @access private
 * @param string
 * @return string
 */
function mb_unserialize($string) {
    $string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $string);
    return unserialize($string);
}

Source: http://snippets.dzone.com/posts/show/6592

Tested on my machine, and it works like charm!!


The reason why unserialize() fails with:

$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}';

Is because the length for héllö and wörld are wrong, since PHP doesn't correctly handle multi-byte strings natively:

echo strlen('héllö'); // 7
echo strlen('wörld'); // 6

However if you try to unserialize() the following correct string:

$ser = 'a:2:{i:0;s:7:"héllö";i:1;s:6:"wörld";}';

echo '<pre>';
print_r(unserialize($ser));
echo '</pre>';

It works:

Array
(
    [0] => héllö
    [1] => wörld
)

If you use PHP serialize() it should correctly compute the lengths of multi-byte string indexes.

On the other hand, if you want to work with serialized data in multiple (programming) languages you should forget it and move to something like JSON, which is way more standardized.