How to remove duplicate files using bash

You can identify duplicate files using the following command:

md5sum * | sort -k1 | uniq -w 32 -d

I'm working on Linux, which means the is the command md5sum which outputs:

> md5sum *
d41d8cd98f00b204e9800998ecf8427e  file_1
d41d8cd98f00b204e9800998ecf8427e  file_10
d41d8cd98f00b204e9800998ecf8427e  file_2
d41d8cd98f00b204e9800998ecf8427e  file_3
d41d8cd98f00b204e9800998ecf8427e  file_4
d41d8cd98f00b204e9800998ecf8427e  file_5
d41d8cd98f00b204e9800998ecf8427e  file_6
d41d8cd98f00b204e9800998ecf8427e  file_7
d41d8cd98f00b204e9800998ecf8427e  file_8
d41d8cd98f00b204e9800998ecf8427e  file_9
b026324c6904b2a9cb4b88d6d61c81d1  other_file_1
31d30eea8d0968d6458e0ad0027c9f80  other_file_10
26ab0db90d72e28ad0ba1e22ee510510  other_file_2
6d7fce9fee471194aa8b5b6e47267f03  other_file_3
48a24b70a0b376535542b996af517398  other_file_4
1dcca23355272056f04fe8bf20edfce0  other_file_5
9ae0ea9e3c9c6e1b9b6252c8395efdc1  other_file_6
84bc3da1b3e33a18e8d5e1bdd7a18d7a  other_file_7
c30f7472766d25af1dc80b3ffc9a58c7  other_file_8
7c5aba41f53293b712fd86d08ed5b36e  other_file_9

Now using awk and xargs the command would be:

md5sum * | \
sort | \
awk 'BEGIN{lasthash = ""} $1 == lasthash {print $2} {lasthash = $1}' | \
xargs rm

The awk part initializes lasthash with the empty string, which will not match any hash, and then checks for each line if the hash in lasthash is the same as the hash (first column) of the current file (second column). If it is, it prints it out. At the end of every step it will set lasthash to the hash of the current file (you could limit this to only be set if the hashes are different, but that should be a minor thing especially if you do not have many matching files). The filenames awk spits out are fed to rm with xargs, which basically calls rm with what the awk part gives us.

You probably need to filter directories before md5sum *.

Edit:

Using Marcins method you could also use this one:

comm -1 -2 \
  <(ls) | \
  <(md5sum * | \
    sort -k1 | \
    uniq -w 32 | \
    awk '{print $2}' | \
    sort) \
xargs rm

This substracts from the filelist optained by ls the first filename of each unique hash optained by md5sum * | sort -k1 | uniq -w 32 | awk '{print $2}'.


I ran across fdupes as an answer to this similar question: https://superuser.com/questions/386199/how-to-remove-duplicated-files-in-a-directory

I was able to apt-get install fdupes on Ubuntu. You will definitely want to read the man page. In my case, I was able to get the desired results like so:

fdupes -qdN -r /ops/backup/

Which says "look recursively through /ops/backup and find all duplicate files: keep the first copy of any given file, and quietly remove the rest." This make it very easy to keep several dumps of an infrequent-write database.