Check data integrity after copying thousands of files

Using MD5 sums is a good way, but the canonical way to use it is:

  1. cd to the directory of the source files and issue:

    md5sum * >/path/to/the/checksumfile.md5
    

If you have directories with many levels, you can use shopt -s globstar and replace * by **/*.

Notice that the file specs in the MD5 file are exactly as provided in the command line (relative paths unless your pattern starts with a /).

  1. cd to the directory of the copied files and issue:

    md5sum -c /path/to/the/checksumfile.md5
    

With -c, md5sum reads the file specs in the provided MD5 file, compute the MD5 of these files, and compares them to the values from the MD5 file (which is why the file specs are usually better left relative, so you can re-use the MD5 file on files in various directories).

Using MD5 sum this ways immediately tells you about MD5 differences, and also about missing files.


Unmount, eject, and remount the device. Then use

diff -r source destination

In case you used rsync to do the copy, rsync -n -c might be very convenient, and it is nearly as good as diff. It doesn't do a bit-for-bit comparison though; it uses an MD5 checksum.


There are some similar answers with other details at: Verifying a large directory after copy from one hard drive to another


rsync -rc original-dir/ copied-dir/

-c causes rsync to compare files by MD5 checksum (without it, it normally uses only the timestamp and size for quicker comparisons).

This will also cause rsync to copy whatever it sees different or missing from the destination. To avoid that, you can also use -n and -i. The former ensures that rsync doesn't do any change and only compares, and the latter causes it to display the differences that it sees.

For example, I have the following dirs:

$ find dir1/ dir2/
dir1/ dir2/
dir1/
dir1/d
dir1/d/a
dir1/d/b
dir1/c
dir2/
dir2/d
dir2/d/a
dir2/d/b

And this:

$ rsync -rcni dir1/ dir2/
>f+++++++++ c
>fc.T...... d/b

Tells me, by way of all those +s, that file c does not exist in dir2, and file d/b does, but is different (indicated by the c in the first column). The T says that it's time would be updated (had we not used -n).

The format of -i's output is described in the manpage for rsync. You can man rsync and get to the part that explains that output by typing /--itemize-changes$ (and hitting Enter).