MD5 hash with different results

Let us use Python to understand this.

>>> '123456çñ'
'123456\xc3\xa7\xc3\xb1'
>>> 'ç'
'\xc3\xa7'
>>> 'ñ'
'\xc3\xb1'

In the above output, we see the UTF-8 encoding of 'ç' and 'ñ'.

>>> md5('123456çñ').digest().encode('hex')
'66f561bb6b68372213dd9768e55e1002'

So, when we compute MD5 hash of the UTF-8 encoded data, we get the first result.

>>> u'ç'
u'\xe7'
>>> u'ñ'
u'\xf1'

Here, we see the Unicode code points of 'ç' and 'ñ'.

>>> md5('123456\xe7\xf1').digest().encode('hex')
'9e6c9a1eeb5e00fbf4a2cd6519e0cfcb'

So, when we compute MD5 hash of the data represented with the Unicode code points of each character in the string (possibly ISO-8859-1 encoded), we get the second result.

So, the first website is computing the hash of the UTF-8 encoded data while the second one is not.


The problem I guess is in different text encodings. The string you show can't be represented in ANSI encoding - it requires UTF-16 or UTF-8. The choice of one of the latter leads to different byte representation of the string and that produces different hashes.

Remember, MD5 hashes bytes, not characters - it's up to you how to encode those characters as bytes before feeding bytes to MD5. If you want to interoperate with other systems you have to use the same encoding as those systems.

Tags:

Hash

Md5