Are there any deduplication scripts that use btrfs CoW as dedup?

I wrote bedup for this purpose. It combines incremental btree scanning with CoW-deduplication. Best used with Linux 3.6, where you can run:

sudo bedup dedup

I tried bedup. While good (and has some useful differentiated features that may make it the best choice for many), it seems to scan the entirety of all target files for checksums.

Which is painfully slow.

Other programs on the other hand, such as rdfind and rmlint, scan differently.

rdfind has an "experimental" feature for using btrfs reflink. (And "solid" options for hardlinks, symlinks, etc.)

rmlint has "solid" options for btrfs clone, reflink, regular hardlinks, symlinks, delete, and your own custom commands.

But more importantly, rdfind and rmlint are significantly faster. As in, orders of magnitude. Rather than scanning all target files for checksums, it does this, approximately:

  • Scan the whole target filesystem, gathering just paths and filesizes.
  • Remove from consideration, files with unique filesizes. This alone, save scads of time and disk activity. ("Scads" is some inverse exponential function or something.)
  • Of the remaining candidates, scan the first N bytes. Remove from consideration, those with same filesizes but different first N bytes.
  • Do the same for the last N bytes.
  • Only of that (usually tiny fraction) remaining, scan for checksums.

Other advantages of rmlint I'm aware of:

  • You can specify the checksum. md5 too scary? Try sha256. Or 512. Or bit-for-bit comparison. Or your own hashing function.
  • It gives you the option of Btrfs "clone", and "reflink", rather than just reflink. "cp --reflink=always" is just a bit risky, in that it's not atomic, it's not aware of what else is going on for that file in the kernel, and it doesn't always preserve metadata. "Clone", OTOH (which is a shorthand term...I'm blanking on the official API-related name), is a kernel-level call that is atomic and preserves metadata. Almost always resulting in the same thing, but a tad more robust and safe. (Though most programs are smart enough to not delete the duplicate file, if it can't first successfully make a temp reflink to the other.)
  • It has a ton of options for many use-cases (which is also a drawback).

I compared rmlint with deduperemove--which also blindly scans all of every target file for checksums. Duperemove took several days on my volume to complete (4 I think), going full-tilt. fmlint took a few hours to identify duplicates, then less than a day to dedup them with Btrfs clone.

(That said, anyone making the effort to write and support quality, robust software and give it away for free, deserves major kudos!)

Btw: You should avoid deduping using regular hardlinks as a "general" dedup solution, at all costs.

While hardlinks can be extremely handy in certain targeted use cases (e.g. individual files or with a tool that can scan for specific file types exceeding some minimum size--or as part of many free and commercial backup/snapshot solutions), it can be disastrous for "deduplication" on a large general-use filesystem. The reason is that most users may have thousands of files on their filesystem, that are binary identical, but functionally completely different.

For example, many programs generate template and/or hidden settings files (sometimes in every single folder it can see), that are initially identical--and most remain so, until you, the user, need them not to be.

As a specific illustration: Photo thumbnail cache files, which countless programs generate in the folder containing the photos (and for good reason--portability), can take hours or days to generate but then make using a photo app a breeze. If those initial cache files are all hardlinked together, then you later open the app on a directory and it builds a large cache...then guess what: Now EVERY folder that has a previously hardlinked cache, now has the wrong cache. Potentially, with disastrous results that may result in accidental data destruction. And also potentially in a way that explodes a backup solution that isn't hardlink-aware.

Furthermore, it can ruin entire snapshots. The whole point of snapshots is so that the "live" version can continue to change, with the ability to roll back to a previous state. If everything is hardlinked together though...you "roll back" to the same thing.

The good news though is that deduping with Btrfs clone/reflink, can undo that damage (I think--since during the scan, it should see hardlinked files as identical...unless it has logic to not consider hardlinks. It probably depends on the specific utility doing the deduping.)