Find duplicate files

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {
    while IFS= read -r file; do
        [[ $file ]] && du "$file"
    done
} | sort -n

This will break if your filenames contain newlines.


Another good tool is fslint:

fslint is a toolset to find various problems with filesystems, including duplicate files and problematic filenames etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a --help option which further details its parameters.

   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:

find / -type f -exec md5sum {} \; > md5sums
awk '{print $1}' md5sums | sort | uniq -d > dupes
while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read -r d; do echo "---"; grep -- "$d" md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.


Short answer: yes.

Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.