Using pv with md5sum

The pv utility is a "fancy cat", which means that you may use pv in most situations where you would use cat.

Using cat with md5sum, you can compute the MD5 checksum of a single file with

cat file | md5sum

or, with pv,

pv file | md5sum

Unfortunately though, this does not allow md5sum to insert the filename into its output properly.

Now, fortunately, pv is a really fancy cat, and on some systems (Linux), it's able to watch the data being passed through another process. This is done by using its -d option with the process ID of that other process.

This means that you can do things like

md5sum dir/* | sort >sums &
sleep 1
pv -d "$(pgrep -n md5sum)"

This would allow pv to watch the md5sum process. The sleep is there to allow md5sum, which is running in the background, to properly start. pgrep -n md5sum would return the PID of the most recently started md5sum process that you own. pv will exit as soon as the process that it is watching terminates.

I've tested this particular way of running pv a few times and it seems to generally work well, but sometimes it seems to stop outputting anything as md5sum switches to the next file. Sometimes, it seems to spawn spurious background tasks in the shell.

It would probably be safest to run it as

md5sum dir/* >sums &
sleep 1
pv -W -d "$!"
sort -o sums sums

The -W option will cause pv to wait until there's actual data being transferred, although this does also not always seem to work reliably.


The data that you are feeding through the pipe is not the data of the files that md5sum is processing, but instead the md5sum output, which, for every file, consists of one line comprising: the MD5-hash, two spaces, and the file name. Since we know this in advance, can inform pv accordingly, so as to enable it to display an accurate progress indicator. There are two ways of doing so.

The first, preferred method (suggested by frostschutz) makes use of the fact that md5sum generates one line per processed file, and the fact that pv has a line mode that counts lines rather than bytes. In this mode pv will only move the progress bar when it encounters a newline in the throughput, i.e. per file finished by md5sum. In Bash, this first method can look like this:

set -- *.iso; md5sum "$@" | pv --line-mode -s $# | sort

The set builtin is used to set the positional parameters to the files to be processed (the *.iso shell pattern is expanded by the shell). md5sum is then told to process these files ($@ expands to the positional parameters), and pv in line mode will move the progress indicator each time a file has been processed / a line is output by md5sum. Notably, pv is informed of the total number of lines it can expect (-s $#), as the special shell parameter $# expands to the number of positional arguments.

The second method is not line-based but byte-based. With md5sum this unnecessarily complicated, but some other program may not produce lines but for instance continuous data, and then this approach may be more practical. I illustrate it with md5sum though. The idea is to calculate the amount of data that md5sum (or some other program) will produce, and use this to inform pv. In Bash, this could look as follows:

os=$(( $( ls -1 | wc -c ) + $( ls -1 | wc -l ) * 34 ))
md5sum * | pv -s $os | sort

The first line calculates the output size (os) estimate: the first term is the number of bytes necessary for encoding the filenames (incl. newline), the second term the number of bytes used for encoding the MD5-hashes (32 bytes each), plus 2 spaces. In the second line, we tell pv that the expected amount of data is os bytes, so that it can show an accurate progress indicator leading up to 100% (which indicator is updated per finished md5summed file).

Obviously, both methods are only practical in case multiple files are to be processed. Also, it should be noted that since the output of md5sum is not related to the amount of time the md5sum program has to spend crunching the underlying data, the progress indicator may be considered somewhat misleading. E.g., in the second method, the file with the shortest name will yield the lowest progress update, even though it may actually be the biggest in size. Then again, if all files have a similar sizes and names, this shouldn't matter much.


Here's a dirty hack to get progress per file:

for f in iso/*
do
    pv "$f" | (
        cat > /dev/null &
        md5sum "$f"
        wait
    )
done

What it looks like:

4.15GiB 0:00:32 [ 130MiB/s] [================================>] 100%            
0db0b36fc7bad7b50835f68c369e854c  iso/KNOPPIX_V7.6.1DVD-2016-01-16-EN.iso
 792MiB 0:00:06 [ 130MiB/s] [================================>] 100%            
97537db63e61d20a5cb71d29145b2937  iso/archlinux-2016.10.01-dual.iso
 843MiB 0:00:06 [ 129MiB/s] [================================>] 100%            
1b5dc31e038499b8409f7d4d720e3eba  iso/lubuntu-16.04-desktop-i386.iso
 259MiB 0:00:02 [ 130MiB/s] [=========>                        ] 30% ETA 0:00:04
...

Now, this makes several assumptions. Firstly, that reading data is slower than hashing it. Secondly, that OS will cache the I/O so data won't be (physically) read twice even though pv and md5sum are completely independent readers.

The nice thing about such a dirty, dirty hack is that you can easily adapt it to make a progress bar across all the data, not just one file. And still do weird stuff like sort the output afterwards.

pv iso/* | (
    cat > /dev/null &
    md5sum iso/* | sort
    wait
)

What it looks like (ongoing):

15.0GiB 0:01:47 [ 131MiB/s] [===========================>      ] 83% ETA 0:00:21

What it looks like (finished):

18.0GiB 0:02:11 [ 140MiB/s] [================================>] 100%            
0db0b36fc7bad7b50835f68c369e854c  iso/KNOPPIX_V7.6.1DVD-2016-01-16-EN.iso
155603390e65f2a8341328be3cb63875  iso/systemrescuecd-x86-4.2.0.iso
1b5dc31e038499b8409f7d4d720e3eba  iso/lubuntu-16.04-desktop-i386.iso
1b6ed6ff8d399f53adadfafb20fb0d71  iso/systemrescuecd-x86-4.4.1.iso
25715326d7096c50f7ea126ac20eabfd  iso/openSUSE-13.2-KDE-Live-i686.iso
...

Now, that's for the hacks. Check other answers for proper solutions. ;-)

Tags:

Pipe

Hashsum

Pv