Simple hash of PIL image

I'm guessing your goal is to perform image hashing in Python (which is much different than classic hashing, since byte representation of images is dependent on format, resolution and etc.)

One of the image hashing techniques would be average hashing. Make sure that this is not 100% accurate, but it works fine in most of the cases.


First we simplify the image by reducing its size and colors, reducing complexity of the image massively contributes to accuracy of comparison between other images:

Reducing size:

img = img.resize((10, 10), Image.ANTIALIAS)

Reducing colors:

img = img.convert("L")

Then, we find average pixel value of the image (which is obviously one of the main components of the average hashing):

pixel_data = list(img.getdata())
avg_pixel = sum(pixel_data)/len(pixel_data)

Finally hash is computed, we compare each pixel in the image to the average pixel value. If pixel is more than or equal to average pixel then we get 1, else it is 0. Then we convert these bits to base 16 representation:

bits = "".join(['1' if (px >= avg_pixel) else '0' for px in pixel_data])
hex_representation = str(hex(int(bits, 2)))[2:][::-1].upper()

If you want to compare this image to other images, you perform actions above, and find similarity between hexadecimal representation of average hashed images. You can use something as simple as hamming distance or more complex algorithms such as Levenshtein distance, Ratcliff/Obershelp pattern recognition (SequenceMatcher), Cosine Similarity etc.


Recognising what you say about timestamps, ImageMagick has exactly such a feature. First, an example.

Here I create two images with identical pixels but a timestamp at least 1 second different:

convert -size 600x100 gradient:magenta-cyan 1.png
sleep 2
convert -size 600x100 gradient:magenta-cyan 2.png

enter image description here

If I checksum them on macOS, it tells me they are different because of the embedded timestamp:

md5 -r [12].png

c7454aa225e3e368abeb5290b1d7a080 1.png
66cb4de0b315505de528fb338779d983 2.png

But if I checksum just the pixels with ImageMagick, (where %# is the pixel-wise checksum), it knows the pixels are identical and I get:

identify -format '%# - %f\n' 1.png 2.png
70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 1.png
70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 2.png

And, in fact, if I make a TIFF file with the same image contents, whether with Motorola or Intel byte order, or a NetPBM PPM file:

convert -size 600x100 gradient:magenta-cyan -define tiff:endian=msb 3motorola.tif
convert -size 600x100 gradient:magenta-cyan -define tiff:endian=lsb 3intel.tif
convert -size 600x100 gradient:magenta-cyan 3.ppm

ImageMagick knows they are the same, despite different file format, CPU architecture and timestamp,:

identify -format '%# - %f\n' 1.png 3.ppm 3{motorola,intel}.tif

70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 1.png
70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 3.ppm
70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 3motorola.tif
70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 3intel.tif

So, in answer to your question, I am suggesting you shell out to ImageMagick with the Python subprocess module and use ImageMagick.