Files with same content but with different md5sums when gzip'd?

According to RFC 1952, the gzip file header includes the modification time of the original file (field MTIME). You can display the header in plain text1) with gzip -lv renew.log.gz:

method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 64263ac7 Jun 21 17:59                 314                 597  52.1% renew.log

So, if you really want to compare the gzip'd files, compress them with the -n option, to not save the original file name and time stamp,

gzip -n renew.log s3/renew.log 

and their md5sum should be identical.

Otherwise you could use

md5sum <(zcat renew.log.gz) <(zcat s3/renew.log.gz)

to calculate the md5sum of the decompressed files.


1) However, the displayed time and date are not taken from the header, but represent the current values; this is also the case for the file name:

$ gzip renew.log 
$ mv renew.log.gz foo.gz
$ gzip -lv foo.gz -------- uncompressed name is taken from current name ---v
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 6c721644 Jul 11 22:34                 580                1586  65.7% foo
$ hexdump -C foo.gz | head -n 2
00000000  1f 8b 08 08 f0 16 df 51  00 03 72 65 6e 65 77 2e  |.......Q..renew.|
00000010  6c 6f 67 00 8d 93 dd 6e  9b 30 18 86 8f 89 94 7b  |log....n.0.....{|
                                                             ^^^-------^^^^^
                                                  original filename is stored in the header

Why do you expect compressed version of the same file to be the same? The compress program (gzip) can include some timestamp in the header, or can use some randomized algorithms.

And exactly! The gzip header contains the timestamp. If you want your compressed files to be the same, your file has to have the same timestamp!

So, when you copy a file, always do it as cp -p file1 file1, not just cp file1 file2 - that is actually a bad habit!


Just use gzip with '-n' flag:

tiagocruz@stark:~$ gzip -n Yippie-Ki-Yay.mp3 bla/Yippie-Ki-Yay.mp3 

tiagocruz@stark:~$ sha1sum Yippie-Ki-Yay.mp3.gz bla/Yippie-Ki-Yay.mp3.gz 
b44b21c5f414935f1ced1187bfafd989704474a5  Yippie-Ki-Yay.mp3.gz
b44b21c5f414935f1ced1187bfafd989704474a5  bla/Yippie-Ki-Yay.mp3.gz

Source: https://unix.stackexchange.com/questions/31008/why-does-the-gzip-version-of-files-produce-a-different-md5-checksum

Tags:

Linux

Md5

Gzip