Why does Visual Studio add "-1937169414" to a generated hash code computation?

If you look for -1521134295 in Microsoft's repositories you'll see that it appears quite a number of times

https://github.com/search?q=org%3Amicrosoft+%22-1521134295%22+OR+0xa5555529&type=Code
https://github.com/search?q=org%3Adotnet++%22-1521134295%22+OR+0xa5555529&type=Code

Most of the search results are in the GetHashCode functions, but they all have the following form

int hashCode = SOME_CONSTANT;
hashCode = hashCode * -1521134295 + field1.GetHashCode();
hashCode = hashCode * -1521134295 + field2.GetHashCode();
// ...
return hashCode;

The first hashCode * -1521134295 = SOME_CONSTANT * -1521134295 will be pre-multiplied during the generation time by the generator or during compilation time by CSC. That's the reason for -1937169414 in your code

Digging deeper into the results reveals the code generation part which can be found in the function CreateGetHashCodeMethodStatements

const int hashFactor = -1521134295;

var initHash = 0;
var baseHashCode = GetBaseGetHashCodeMethod(containingType);
if (baseHashCode != null)
{
    initHash = initHash * hashFactor + Hash.GetFNVHashCode(baseHashCode.Name);
}

foreach (var symbol in members)
{
    initHash = initHash * hashFactor + Hash.GetFNVHashCode(symbol.Name);
}

As you can see the hash depends on the symbol names. In that function the constant is also called permuteValue, probably because after the multiplication the bits are permuted around somehow

// -1521134295
var permuteValue = CreateLiteralExpression(factory, hashFactor);

There are some patterns if we view the value in binary: 101001 010101010101010 101001 01001 or 10100 1010101010101010 10100 10100 1. But if we multiply an arbitrary value with that then there are lots of overlapping carries so I couldn't see how it works. The output may also has different number of set bits so it's not really a permutation

You can find the another generator in Roslyn's AnonymousTypeGetHashCodeMethodSymbol which calls the constant HASH_FACTOR

//  Method body:
//
//  HASH_FACTOR = 0xa5555529;
//  INIT_HASH = (...((0 * HASH_FACTOR) + GetFNVHashCode(backingFld_1.Name)) * HASH_FACTOR
//                                     + GetFNVHashCode(backingFld_2.Name)) * HASH_FACTOR
//                                     + ...
//                                     + GetFNVHashCode(backingFld_N.Name)

The real reason for choosing that value is yet still unclear

As GökhanKurt explained in the comments, the number changes based upon the property names involved. If you rename the property to Halue, the number becomes 387336856 instead. I had tried it with different classes but didn't think of renaming the property.

Gökhan's comment made me understand its purpose. It's offsetting hash values based on a deterministic, but randomly distributed offset. This way, combining hash values for different classes, even with a simple addition, is still slightly resistant to hash collisions.

For instance, if you have two classes with a similar GetHashCode implementations:

public class A
{
    public int Value { get; set;}
    public int GetHashCode() => Value;
}

public class B
{
    public int Value { get; set;}
    public override int GetHashCode() => Value;
}

and if you have another class that contains references to these two:

public class C
{
    public A ValueA { get; set; }
    public B ValueB { get; set; }
    public override int GetHashCode()
    {
        return ValueA.GetHashCode() + ValueB.GetHashCode();
    }
}

a poor combination like this would be prone to hash collisions because the resulting hash code would accumulate around the same area for different values of ValueA and ValueB if their values are close to each other. It really doesn't matter if you use multiplication or bitwise operations to combine them, they would still be prone to collisions without an evenly distanced offset. As many integer values used in programming are accumulated around 0, it makes sense to use such an offset

Apparently, it's a good practice to have a random offset with good bit patterns.

I'm still not sure why they don't use completely random offsets, probably not to break any code that relies on determinism of GetHashCode(), but it would be great to receive a comment from Visual Studio team about this.

Why does Visual Studio add "-1937169414" to a generated hash code computation?

Tags:

C#

Visual Studio

Related

Recent Posts