Can the "find" command work more efficiently to delete many files?

The reason why the find command is slow

That is a really interesting issue... or, honestly, mallicious:

The command

find . -mindepth 2 -mtime +5 -print -delete

is very different from the usual tryout variant, leaving out the dangerous part, -delete:

find . -mindepth 2 -mtime +5 -print

The tricky part is that the action -delete implies the option -depth. The command including delete is really

find . -depth -mindepth 2 -mtime +5 -print -delete

and should be tested with

find . -depth -mindepth 2 -mtime +5 -print

That is closely related to the symtoms you see; The option -depth is changing the tree traversal algorithm for the file system tree from an preorder depth-first search to an inorder depth-first search.
Before, each file or directory that was reached was immediately used, and forgotten about. Find was using the tree itself to find it's way. find will now need to collect all directories that could contain files or directories still to be found, before deleting the files in the deepest directoies first. For this, it needs to do the work of planing and remembering traversal steps itself, and - that's the point - in a different order than the filesystem tree naturally supports. So, indeed, it needs to collect data over many files before the first step of output work.

Find has to keep track of some directories to visit later, which is not a problem for a few directories.
But maybe with many directories, for various degrees of many.
Also, performance problems outside of find will get noticable in this kind of situation; So it is possible it's not even find that's slow, but something else.

The performance and memory impact of that depends on your directory structure etc.


The relevant sections from man find:

See the "Warnings":

ACTIONS
    -delete
           Delete  files;  true if removal succeeded.  If the removal failed,
           an error message is issued.  If -delete fails, find's exit  status
           will  be nonzero (when it eventually exits).  Use of -delete auto‐
           matically turns on the -depth option.

           Warnings: Don't forget that the find command line is evaluated  as
           an  expression,  so  putting  -delete  first will make find try to
           delete everything below the starting points you  specified.   When
           testing  a  find  command  line  that you later intend to use with
           -delete, you should explicitly specify -depth in  order  to  avoid
           later  surprises.  Because -delete implies -depth, you cannot use‐
           fully use -prune and -delete together.
    [ ... ]

And, from a section further up:

 OPTIONS
    [ ... ]
    -depth Process each directory's contents  before  the  directory  itself.
           The -delete action also implies -depth.


The faster solution to delete the files

You do not really need to delete the directories in the same run of deleting the files, right? If we are not deleting directories, we do not need the whole -depth thing, we can just find a file and delete it, and go on to the next, as you proposed.

This time we can use the simple print variant for testing the find, with implicit -print.

We want to find only plain files, no symlinks, directories, special files etc:

find . -mindepth 2 -mtime +5 -type f

We use xargs to delete more than one file per rm process started, taking care of odd filenames by using a null byte as separator:

Testing this command - note the echo in front of the rm, so it prints what will be run later:

find . -mindepth 2 -mtime +5 -type f -print0 | xargs -0 echo rm

The lines will be very long and hard to read; For an initial test it could help to get readable output with only three files per line by adding -n 3 as first arguments of xargs

If all looks good, remove the echo in front of the rm and run again.

That should be a lot faster;


In case we are talking about millions of files - you wrote it's 600 million files in total - there is something more to take into account:

Most programs, including find, read directories using the library call readdir (3). That usually uses a buffer of 32 KB to read directories; That becomes a problem when the directories, containing huge lists of possibly long filenames, are big.

The way to work around it is to directly use the system call for reading directory entries, getdents (2), and handle the buffering in a more suitable way.

For details, see You can list a directory containing 8 million files! But not with ls..


(It would be interesting if you can add details to your question on the typical numbers of files per directroy, directories per directory, max depth of paths; Also, which filesystem is used.)

(If it is still slow, you should check for filesystem performance problems.)

Tags:

Find