Search and Delete duplicate files with different names

There is such a program, and it's called rdfind:

SYNOPSIS
   rdfind [ options ] directory1 | file1 [ directory2 | file2 ] ...

DESCRIPTION
   rdfind  finds duplicate files across and/or within several directories.
   It calculates checksum only if necessary.  rdfind  runs  in  O(Nlog(N))
   time with N being the number of files.

   If  two  (or  more) equal files are found, the program decides which of
   them is the original and the rest are considered  duplicates.  This  is
   done  by  ranking  the  files  to each other and deciding which has the
   highest rank. See section RANKING for details.

It can delete the duplicates, or replace them with symbolic or hard links.


Hmmph. I just developed a one-liner to list all duplicates, for a question that turned out to be a duplicate of this. How meta. Well, shame to waste it, so I'll post it, though rdfind sounds like a better solution.

This at least has the advantage of being the "real" Unix way to do it ;)

find -name '*.mp3' -print0 | xargs -0 md5sum | sort | uniq -Dw 32

Breaking the pipeline down:

find -name '*.mp3' -print0 finds all mp3 files in the subtree starting at the current directory, printing the names NUL-separated.

xargs -0 md5sum reads the NUL-separated list and computes a checksum on each file.

You know what sort does.

uniq -Dw 32 compares the first 32 characters of the sorted lines and prints only the ones that have the same hash.

So you end up with a list of all duplicates. You can then whittle that down manually to the ones you want to delete, remove the hashes, and pipe the list to rm.


I'm glad you got the job done with rdfind.

Next time you could also consider rmlint. It's extremely fast and offers a few different options to help determine which file is the original in each set of duplicates.