hashing different tuples in python give identical result

Until Python 3.8, the hash of a tuple is based on the hashes of the content using the following formula (from the tuplehash() function):

Py_uhash_t mult = _PyHASH_MULTIPLIER; /* defined as 1000003UL == 0xf4243 */
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
        return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
    x = -2;
return x;

This is a method known as the FNV-1 (Fowler / Noll / Vo) hash method.

As it happens, that formula produces the exact same output for (1, 0, -1) and (1, -1, 0):

>>> hash((1, -1, 0))
-2528505496374624146
>>> hash((1, 0, -1))
-2528505496374624146

because the hashes for the 3 contained integers are 1, 0 and -2:

>>> hash(1)
1
>>> hash(0)
0
>>> hash(-1)
-2

and swapping the 0 and the -2 has no actual influence on the outcome.

So the hashes for the 3 contained tuples don't change between the two examples, so the final hash doesn't change either.

This is just a coincidence, and I'd expect that in practice this doesn't happen all that often and dictionaries and sets already can handle collisions just fine.

However, a few years after writing my original answer, it turns out that my expectation was misplaced! The above tuplehash() implementation was in place for 14 years, until someone insisted that there was a problem with the scheme. It turns out that certain value combinations (such as 4 and -4, or 0.25 and 0.5) drastically reduced the possible hash values the method could output:

>>> import sys; from itertools import product
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=7, releaselevel='final', serial=0)
>>> values = (0.25, 0.5)
>>> sum(1 for _ in product(values, repeat=20))  # 20 elements in each tuple
1048576
>>> len(set(map(hash, product(values, repeat=20))))
32

The above creates all 1048576 (2 ** 20 == 1024 ** 2) possible 20-value tuples that combine 0.25 and 0.5. Ideally, they all should have a different hash value, or at least have a very large number of different hash values. But the above tuplehash() function only produced 32 unique values. Each of those 32 unique hashes applies to 32768 (2 ** 15) such combinations:

>>> from collections import Counter
>>> Counter(Counter(map(hash, product(values, repeat=20))).values())
Counter({32768: 32})

This is actually quite a big problem! The above issue also comes into play for 1, -1, 0, it just isn't as pronounced; testing here with 3 ** 12 == 531441 combinations:

>>> values = (1, -1, 0)
>>> sum(1 for _ in product(values, repeat=12))
531441
>>> len(set(map(hash, product(values, repeat=12))))
238605
>>> Counter(Counter(map(hash, product(values, repeat=12))).values())
Counter({1: 153005, 2: 51006, 4: 21730, 8: 8424, 16: 3012, 32: 994, 64: 314, 128: 92, 256: 20, 512: 6, 1024: 2})

so 153005 of the hashes produced for those 12-element tuples are all using a single hash.

So in Python 3.8, the implementation was switched from FNV-1a to an adaptation of the xxHash fast digest scheme. See the new tuplehash() function implementation for details.

This new method performs great on the examples from your question:

>>> sys.version_info
sys.version_info(major=3, minor=8, micro=1, releaselevel='final', serial=0)
>>> hash((1, -1, 0))
426056430309831993
>>> hash((1, 0, -1))
-7823806182320511195
>>> hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))
-6252168277346219339
>>> hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))
-5221381175350594014

as well as the pathological cases I discussed above:

>>> values = (0.25, 0.5)
>>> len(set(map(hash, product(values, repeat=20))))
1048576
>>> values = (1, -1, 0)
>>> len(set(map(hash, product(values, repeat=12))))
531441

hashing different tuples in python give identical result

Tags:

Python

Tuples

Hash

Related

Recent Posts