Counter of unique files in a directory

I'm a big fan of GNU datamash (https://www.gnu.org/software/datamash/). Here's a sample output from a mocked up set of files I created and ran this command on:

$ md5sum * | datamash -W -s -g 1 count 2 -f
5591dadf0051bee654ea41d962bc1af0    junk1   27
9c08c31b951a1a1e0c3a38effaca5863    junk2   17
f1e5cbfade7063a0c4fa5083fd36bf1a    junk3   7

There are 27 files with the hash 5591..., and one of them is "junk1". (Similarly 17 files that are the same as "junk2", and 7 for "junk3").

The -W says use whitespace as field delimiter. The -s -g 1 says sort and group by field 1 (which is the hash). The count could have been either field 1 or 2, doesn't matter.

The -f says "print the entire input line". This has a quirk, in that when you are printing aggregated results, it only prints the full line for the first line in each group that it found. In this case that works out fine, because it gives us one of the filenames involved in each dup-set, instead of all of them.


Expanding slightly on @Isaac's solution ....

Assuming bash syntax, and given:

$ find test -type f
test/AA
test/A
test/C
test/CC
test/B
test/D

where files A and AA are identical, as are C and CC;

This is an incrementally more effective command pipeline:

$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" \; |
    sort -k1,1 |
    uniq --count
      2 102f2ac1c3266e03728476a790bd9c11  -
      1 4c33d7f68620b7b137c0ca3385cb6597  -
      1 88178a003e2305475e754a7ec21d137d  -
      2 c7a739d5538cf472c8e87310922fc86c  -

The remaining problem now is that the md5 hashes don't tell you which files are A, B, C or D. That can be solved, although it's a slight bit fiddly.

First, move your files into a subdirectory, or move your PWD up one directory if that's more convenient. In my example, I'm working in . and the files are in test/.

I'll propose that you identify one each of the four file types, and copy them to file A, B, C and D (and beyond if you need to, up to Z):

$ cp -p test/file1002 ./A
...
$ cp -p test/file93002 ./N

etc. We can now build a hash table that defines the md5 hashes of each unique output file A-Z:

$ for file in [A-Z]; do 
      printf "s/%s/%s/\n" "$(md5sum < $file )" "$file"; 
done
s/102f2ac1c3266e03728476a790bd9c11  -/A/
s/4c33d7f68620b7b137c0ca3385cb6597  -/B/
s/c7a739d5538cf472c8e87310922fc86c  -/C/
s/88178a003e2305475e754a7ec21d137d  -/D/

Notice that the hash table looks like sed syntax. Here's why:

Let's run the same find ... md5sum pipeline above:

$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" \; |
    sort -k1,1 |
    uniq --count

... and pipe it through a sed process that uses the hash table above to replace the hash values with the prototype file names. The sed command on its own would be:

sed -f <(
    for file in [A-Z]; do 
        printf "s/%s/%s/\n" "$(md5sum < "$file")" "$file"; 
    done
)

So to connect it all together:

$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" \; |
    sort -k1,1 |
    uniq --count |
    sed -f <(
        for file in [A-Z]; do 
            printf "s/%s/%s/\n" "$(md5sum < "$file")" "$file"; 
        done
    )
  2 A
  1 B
  1 D
  2 C

If you see output like this:

  2 A
  1 B
  1 5efa8621f70e1cad6aba9f8f4246b383  -
  1 D
  2 C

That means there is a file in test/ which has an MD5 value that doesn't match your files A-D. In other words, there is an E output file format out there somewhere. Once you find it (md5sum test/* | grep 5efa8621f70e1cad6aba9f8f4246b383) you can copy it to E and re-run:

$ cp -p test/file09876 ./E
$ find test -maxdepth 1 -type f -exec bash -c "md5sum < {}" \; |
    sort -k1,1 |
    uniq --count |
    sed -f <(
        for file in [A-Z]; do 
            printf "s/%s/%s/\n" "$(md5sum < "$file")" "$file"; 
        done
    )
  2 A
  1 B
  1 E
  1 D
  2 C