Hash collision: "NO" means "YES"

32-bit Python 2.x (19)

hash(w*9)%537105043

RSA uses a semiprime modulus, and that makes it secure, so using one with my hash algorithm should surely make it even better!1

This is a pure math function, works for all strings (hell, works for any hashable Python object), and doesn't contain any conditionals or special-casing! 32-bit Python can typically be called as python-32 on most systems that have both installed2.

I've tested this, and it returns 18,278 different values for the 18,279 3-letter-or-less uppercase strings. Assigning this to a function takes 11 more bytes:

h=lambda w:hash(w*9)%537105043

and h('YES') == h('NO') == 188338253.

64-bit Python 2.x (19)

hash(w*2)%105706823

Same deal as above.


To come up with these numbers, a little bit of modular math was used. I was looking for a function f and a modulus n such that hash(f('YES')) % n == hash(f('NO')) % n. This is equivalent to testing that n divides d = hash(f('YES')) - hash(f('NO')), i.e. we only have to check the factors of d for suitable values of n.

The ideal n is in the neighbourhood of 20000**2 to reduce the chance of a birthday paradox collision. Finding a suitable n turns out to be a bit of trial and error, playing with all the factors of d (there usually aren't many) and different choices for the function f. Notice though that the trial and error is only needed because I wanted to make n as small as possible (for golfing). If that wasn't a requirement, I could just choose d as my modulus, which is usually sufficiently large.

Note too that you can't pull this trick off using just f(s) = s (the identity function) because the rightmost character of the string has essentially a linear relationship (actually an XOR relationship) with the final hash (the other characters contribute in a much more nonlinear way). The repetition of the string therefore ensures that the differences between the strings are amplified to eliminate the effect of changing only the rightmost character.


1 This is patent nonsense.
2 Python string hashing depends on major version (2 vs 3) and bitness (32-bit vs 64-bit). It does not depend on platform AFAIK.


Perl, 53 49 40 bytes

sub h{hex(unpack H6,pop)-20047||5830404}

Test:

h('YES') = 5830404
h('NO')  = 5830404
Keys:   18279
Values: 18278

The hash values for YES and NO are the same and there are 18279 strings ^[A-Z]{0,3}$, which are collision free except for the only collision for YES and NO.

Ungolfed:

sub h {
    hex(unpack("H6", pop())) - 20047 || 5830404;
    # The argument is the first and only element in the argument array @_.
    # "pop" gets the argument from array @_ (from the end).
    # The first three bytes of the argument or less, if the argument
    # is shorter, are converted to a hex string, examples:
    #   "YES" -> "594553"
    #   "NO"  -> "4e4f"
    # Then the hex string is converted to a number by function "hex":
    #   0x594553 = 5850451
    #   0x4e4f   =   20047
    # The value for "NO" is subtracted, examples:
    #   case "YES": 5850451 - 20047 = 5830404
    #   case "NO":    20047 - 20047 =       0
    # If the argument is "NO", the subtraction is zero, therefore
    # 5830404 is returned, the result of "YES".
}

# Test
my %cache;
sub addcache ($) {$cache{$_[0]} = h($_[0])}

# Check entries 'YES' and 'NO'
addcache 'YES';
addcache 'NO';
print "h('YES') = $cache{'YES'}\n";
print "h('NO')  = $cache{'NO'}\n";

# Fill cache with all strings /^[A-Z]{0-3}$/
addcache '';
for my $one (A..Z) {
    addcache $one;
    for (A..Z) {
        my $two = "$one$_";
        addcache $two;
        for (A..Z) {
            my $three = "$two$_";
            addcache $three;
        }
    }
}
# Compare number of keys with number of unique values
my $keys = keys %cache;
my %hash;
@hash{values %cache} = 1 x $keys;
$values = keys %hash;
print "Keys:   $keys\n";
print "Values: $values\n";

Older version, 49 bytes

Since the new algorithm is slightly different, I keep the old version.

sub h{($_=unpack V,pop."\0"x4)==20302?5457241:$_}

Test:

h('YES') = 5457241
h('NO')  = 5457241
Keys:   18279
Values: 18278

Ungolfed:

sub h {
    $_ = unpack('V', pop() . ($" x 4);
        # pop():  gets the argument (we have only one).
        # $" x 4: generates the string "    " (four spaces);
        #   adding the four spaces ensures that the string is long
        #   enough for unpack's template "V".
        # unpack('V', ...): takes the first four bytes as
        #   unsigned long 32-bit integer in little-endian ("VAX") order.
    $_ == 20302 ? 5457241 : $_;
        # If the hash code would be "NO", return the value for "YES".
}

Edits:

  • Using "\0" as fill byte saves 4 bytes in comparison to $".

Python: 63

An incredibly lame solution:

def h(s):
 try:r=int(s,36)
 except:r=0
 return(r,44596)[r==852]

It works by interpreting alphanumeric strings as base-36 numbers, and returning 0 for everything else. There's an explicit special case to check for a return value of 852 (NO), and return 44596 (YES) instead.