How to Calculate a Hash of a file that is 1 Terabyte and over?

If you have a 1 million MB file, and your system can read this file at 100MB/s, then

  • 1TB * 1000(TB/GB) = 1000 GB
  • 1000GB * 1000(MB/GB) = 1 million MB
  • 1 million MB/100(MB/s) = 10 thousand seconds
  • 10000s/3600(s/hr) = 2.77... hr
  • Therefore, a 100MB/s system has a hard floor of 2.77... hrs to even read the file in the first place, even before whatever additional total time may be required to compute a hash.

Your expectations are probably unrealistic - don't try to calculate a faster hash until you can perform a faster file read.


Old and already answered, but you may try to select specific chunks of file.

There is a perl solution i found somewhere and it that seems effective, code not mine:

#!/usr/bin/perl

use strict;
use Time::HiRes qw[ time ];
use Digest::MD5;

sub quickMD5 {
    my $fh = shift;
    my $md5 = new Digest::MD5->new;

    $md5->add( -s $fh );

    my $pos = 0;
    until( eof $fh ) {
        seek $fh, $pos, 0;
        read( $fh, my $block, 4096 ) or last;
        $md5->add( $block );
        $pos += 2048**2;
    }
    return $md5;
}

open FH, '<', $ARGV[0] or die $!;
printf "Processing $ARGV[0] : %u bytes\n", -s FH;

my $start = time;
my $qmd5 = quickMD5( *FH );
printf "Partial MD5 took %.6f seconds\n", time() - $start;
print "Partial MD5: ", $qmd5->hexdigest, "\n";

Basically the script perform MD5 on first 4KB for every 4MB block in file (actually original one did every 1MB).