Why is looping over find's output bad practice?

Why is looping over find's output bad practice?

The simple answer is:

Because filenames can contain any character.

Therefore, there is no printable character you can reliably use to delimit filenames.


Newlines are often used (incorrectly) to delimit filenames, because it is unusual to include newline characters in filenames.

However, if you build your software around arbitrary assumptions, you at best simply fail to handle unusual cases, and at worst open yourself up to malicious exploits that give away control of your system. So it's a question of robustness and safety.

If you can write software in two different ways, and one of them handles edge cases (unusual inputs) correctly, but the other one is easier to read, you might argue that there is a tradeoff. (I wouldn't. I prefer correct code.)

However, if the correct, robust version of the code is also easy to read, there is no excuse for writing code that fails on edge cases. This is the case with find and the need to run a command on each file found.


Let's be more specific: On a UNIX or Linux system, filenames may contain any character except for a / (which is used as a path component separator), and they may not contain a null byte.

A null byte is therefore the only correct way to delimit filenames.


Since GNU find includes a -print0 primary which will use a null byte to delimit the filenames it prints, GNU find can safely be used with GNU xargs and its -0 flag (and -r flag) to handle the output of find:

find ... -print0 | xargs -r0 ...

However, there is no good reason to use this form, because:

  1. It adds a dependency on GNU findutils which doesn't need to be there, and
  2. find is designed to be able to run commands on the files it finds.

Also, GNU xargs requires -0 and -r, whereas FreeBSD xargs only requires -0 (and has no -r option), and some xargs don't support -0 at all. So it's best to just stick to POSIX features of find (see next section) and skip xargs.

As for point 2—find's ability to run commands on the files it finds—I think Mike Loukides said it best:

find's business is evaluating expressions -- not locating files. Yes, find certainly locates files; but that's really just a side effect.

--Unix Power Tools


POSIX specified uses of find

What's the proper way to run one or more commands for each of find's results?

To run a single command for each file found, use:

find dirname ... -exec somecommand {} \;

To run multiple commands in sequence for each file found, where the second command should only be run if the first command succeeds, use:

find dirname ... -exec somecommand {} \; -exec someothercommand {} \;

To run a single command on multiple files at once:

find dirname ... -exec somecommand {} +

find in combination with sh

If you need to use shell features in the command, such as redirecting the output or stripping an extension off the filename or something similar, you can make use of the sh -c construct. You should know a few things about this:

  • Never embed {} directly in the sh code. This allows for arbitrary code execution from maliciously crafted filenames. Also, it's actually not even specified by POSIX that it will work at all. (See next point.)

  • Don't use {} multiple times, or use it as part of a longer argument. This isn't portable. For example, don't do this:

    find ... -exec cp {} somedir/{}.bak \;

    To quote the POSIX specifications for find:

    If a utility_name or argument string contains the two characters "{}", but not just the two characters "{}", it is implementation-defined whether find replaces those two characters or uses the string without change.

    ... If more than one argument containing the two characters "{}" is present, the behavior is unspecified.

  • The arguments following the shell command string passed to the -c option are set to the shell's positional parameters, starting with $0. Not starting with $1.

    For this reason, it's good to include a "dummy" $0 value, such as find-sh, which will be used for error reporting from within the spawned shell. Also, this allows use of constructs such as "$@" when passing multiple files to the shell, whereas omitting a value for $0 would mean the first file passed would be set to $0 and thus not included in "$@".


To run a single shell command per file, use:

find dirname ... -exec sh -c 'somecommandwith "$1"' find-sh {} \;

However it will usually give better performance to handle the files in a shell loop so that you don't spawn a shell for every single file found:

find dirname ... -exec sh -c 'for f do somecommandwith "$f"; done' find-sh {} +

(Note that for f do is equivalent to for f in "$@"; do and handles each of the positional parameters in turn—in other words, it uses each of the files found by find, regardless of any special characters in their names.)


Further examples of correct find usage:

(Note: Feel free to extend this list.)

  • Filter files generated by `find` by parsed output of `file` command
  • substring removal in find -exec
  • How to Do this List Comparison with Find?
  • Using literal empty curly braces {} inside sed command from find -exec
  • How do I delete file by filename that are set as dates?
  • bash: Deleting directories not containing given strings
  • Grep word within a file then copy the file
  • Remove certain types of files except in a folder

The problem

for f in $(find .)

combines two incompatible things.

find prints a list of file paths delimited by newline characters. While the split+glob operator that is invoked when you leave that $(find .) unquoted in that list context splits it on the characters of $IFS (by default includes newline, but also space and tab (and NUL in zsh)) and performs globbing on each resulting word (except in zsh) (and even brace expansion in ksh93 or pdksh derivatives!).

Even if you make it:

IFS='
' # split on newline only
set -o noglob # disable glob (also disables brace expansion in pdksh
              # but not ksh93)
for f in $(find .) # invoke split+glob

That's still wrong as the newline character is as valid as any in a file path. The output of find -print is simply not post-processable reliably (except by using some convoluted trick, as shown here).

That also means the shell needs to store the output of find fully, and then split+glob it (which implies storing that output a second time in memory) before starting to loop over the files.

Note that find . | xargs cmd has similar problems (there, blanks, newline, single quote, double quote and backslash (and with some xarg implementations bytes not forming part of valid characters) are a problem)

More correct alternatives

The only way to use a for loop on the output of find would be to use zsh that supports IFS=$'\0' and:

IFS=$'\0'
for f in $(find . -print0)

(replace -print0 with -exec printf '%s\0' {} + for find implementations that don't support the non-standard (but quite common nowadays) -print0).

Here, the correct and portable way is to use -exec:

find . -exec something with {} \;

Or if something can take more than one argument:

find . -exec something with {} +

If you do need that list of files to be handled by a shell:

find . -exec sh -c '
  for file do
    something < "$file"
  done' find-sh {} +

(beware it may start more than one sh).

On some systems, you can use:

find . -print0 | xargs -r0 something with

though that has little advantage over the standard syntax and means something's stdin is either the pipe or /dev/null.

One reason you may want to use that could be to use the -P option of GNU xargs for parallel processing. The stdin issue can also be worked around with GNU xargs with the -a option with shells supporting process substitution:

xargs -r0n 20 -P 4 -a <(find . -print0) something

for instance, to run up to 4 concurrent invocations of something each taking 20 file arguments.

With zsh or bash, another way to loop over the output of find -print0 is with:

while IFS= read -rd '' file <&3; do
  something "$file" 3<&-
done 3< <(find . -print0)

read -d '' reads NUL delimited records instead of newline delimited ones.

bash-4.4 and above can also store files returned by find -print0 in an array with:

readarray -td '' files < <(find . -print0)

The zsh equivalent (which has the advantage of preserving find's exit status):

files=(${(0)"$(find . -print0)"})

With zsh, you can translate most find expressions to a combination of recursive globbing with glob qualifiers. For instance, looping over find . -name '*.txt' -type f -mtime -1 would be:

for file (./**/*.txt(ND.m-1)) cmd $file

Or

for file (**/*.txt(ND.m-1)) cmd -- $file

(beware of the need of -- as with **/*, file paths are not starting with ./, so may start with - for instance).

ksh93 and bash eventually added support for **/ (though not more advances forms of recursive globbing), but still not the glob qualifiers which makes the use of ** very limited there. Also beware that bash prior to 4.3 follows symlinks when descending the directory tree.

Like for looping over $(find .), that also means storing the whole list of files in memory1. That may be desirable though in some cases when you don't want your actions on the files to have an influence on the finding of files (like when you add more files that could end-up being found themselves).

Other reliability/security considerations

Race conditions

Now, if we're talking of reliability, we have to mention the race conditions between the time find/zsh finds a file and checks that it meets the criteria and the time it is being used (TOCTOU race).

Even when descending a directory tree, one has to make sure not to follow symlinks and to do that without TOCTOU race. find (GNU find at least) does that by opening the directories using openat() with the right O_NOFOLLOW flags (where supported) and keeping a file descriptor open for each directory, zsh/bash/ksh don't do that. So in the face of an attacker being able to replace a directory with a symlink at the right time, you could end up descending the wrong directory.

Even if find does descend the directory properly, with -exec cmd {} \; and even more so with -exec cmd {} +, once cmd is executed, for instance as cmd ./foo/bar or cmd ./foo/bar ./foo/bar/baz, by the time cmd makes use of ./foo/bar, the attributes of bar may no longer meet the criteria matched by find, but even worse, ./foo may have been replaced by a symlink to some other place (and the race window is made a lot bigger with -exec {} + where find waits to have enough files to call cmd).

Some find implementations have a (non-standard yet) -execdir predicate to alleviate the second problem.

With:

find . -execdir cmd -- {} \;

find chdir()s into the parent directory of the file before running cmd. Instead of calling cmd -- ./foo/bar, it calls cmd -- ./bar (cmd -- bar with some implementations, hence the --), so the problem with ./foo being changed to a symlink is avoided. That makes using commands like rm safer (it could still remove a different file, but not a file in a different directory), but not commands that may modify the files unless they've been designed to not follow symlinks.

-execdir cmd -- {} + sometimes also works but with several implementations including some versions of GNU find, it is equivalent to -execdir cmd -- {} \;.

-execdir also has the benefit of working around some of the problems associated with too deep directory trees.

In:

find . -exec cmd {} \;

the size of the path given to cmd will grow with the depth of the directory the file is in. If that size gets bigger than PATH_MAX (something like 4k on Linux), then any system call that cmd does on that path will fail with a ENAMETOOLONG error.

With -execdir, only the file name (possibly prefixed with ./) is passed to cmd. File names themselves on most file systems have a much lower limit (NAME_MAX) than PATH_MAX, so the ENAMETOOLONG error is less likely to be encountered.

Bytes vs characters

Also, often overlooked when considering security around find and more generally with handling file names in general is the fact that on most Unix-like systems, file names are sequences of bytes (any byte value but 0 in a file path, and on most systems (ASCII based ones, we'll ignore the rare EBCDIC based ones for now) 0x2f is the path delimiter).

It's up to the applications to decide if they want to consider those bytes as text. And they generally do, but generally the translation from bytes to characters is done based on the user's locale, based on the environment.

What that means is that a given file name may have different text representation depending on the locale. For instance, the byte sequence 63 f4 74 e9 2e 74 78 74 would be côté.txt for an application interpreting that file name in a locale where the character set is ISO-8859-1, and cєtщ.txt in a locale where the charset is IS0-8859-5 instead.

Worse. In a locale where the charset is UTF-8 (the norm nowadays), 63 f4 74 e9 2e 74 78 74 simply couldn't be mapped to characters!

find is one such application that considers file names as text for its -name/-path predicates (and more, like -iname or -regex with some implementations).

What that means is that for instance, with several find implementations (including GNU find).

find . -name '*.txt'

would not find our 63 f4 74 e9 2e 74 78 74 file above when called in a UTF-8 locale as * (which matches 0 or more characters, not bytes) could not match those non-characters.

LC_ALL=C find... would work around the problem as the C locale implies one byte per character and (generally) guarantees that all byte values map to a character (albeit possibly undefined ones for some byte values).

Now when it comes to looping over those file names from a shell, that byte vs character can also become a problem. We typically see 4 main types of shells in that regard:

  1. The ones that are still not multi-byte aware like dash. For them, a byte maps to a character. For instance, in UTF-8, côté is 4 characters, but 6 bytes. In a locale where UTF-8 is the charset, in

    find . -name '????' -exec dash -c '
      name=${1##*/}; echo "${#name}"' sh {} \;
    

    find will successfully find the files whose name consists of 4 characters encoded in UTF-8, but dash would report lengths ranging between 4 and 24.

  2. yash: the opposite. It only deals with characters. All the input it takes is internally translated to characters. It makes for the most consistent shell, but it also means it cannot cope with arbitrary byte sequences (those that don't translate to valid characters). Even in the C locale, it can't cope with byte values above 0x7f.

    find . -exec yash -c 'echo "$1"' sh {} \;
    

    in a UTF-8 locale will fail on our ISO-8859-1 côté.txt from earlier for instance.

  3. Those like bash or zsh where the multi-byte support has been progressively added. Those will fall back to considering bytes that can't be mapped to characters as if they were characters. They still have a few bugs here and there especially with less common multi-byte charsets like GBK or BIG5-HKSCS (those being quite nasty as many of their multi-byte characters contain bytes in the 0-127 range (like the ASCII characters)).

  4. Those like the sh of FreeBSD (11 at least) or mksh -o utf8-mode that support multi-bytes but only for UTF-8.

Notes

1 For completeness, we could mention a hacky way in zsh to loop over files using recursive globbing without storing the whole list in memory:

process() {
  something with $REPLY
  false
}
: **/*(ND.m-1+process)

+cmd is a glob qualifier that calls cmd (typically a function) with the current file path in $REPLY. The function returns true or false to decide if the file should be selected (and may also modify $REPLY or return several files in a $reply array). Here we do the processing in that function and return false so the file is not selected.


This answer is for very large result sets and concerns performance mainly, for example when getting a list of files over a slow network. For small amounts of files (say a few 100 or maybe even 1000 on a local disk) most of this is moot.

Parallelism and memory usage

Aside from the other answers given, related to separation problems and such, there is another issue with

for file in `find . -type f -name ...`; do smth with ${file}; done

The part inside the backticks has to be evaluated fully first, before being split on the linebreaks. This means, if you get a huge amount of files, it may either choke on whatever size limits are there in the various components; you may run out of memory if there are no limits; and in any case you have to wait until the whole list has been output by find and then parsed by for before even running your first smth.

The preferred unix way is to work with pipes, which are inherently running in parallel, and which also do not need arbitrarily huge buffers in general. That means: you would much prefer for the find to run in parallel to your smth, and only keep the current file name in RAM while it hands that off to smth.

One at least partly OKish solution for that is the aforementioned find -exec smth. It removes the need to keep all the file names in memory and runs nicely in parallel. Unfortunately, it also starts one smth process per file. If smth can only work on one file, then that's the way it has to be.

If at all possible, the optimal solution would be find -print0 | smth, with smth being able to process file names on its STDIN. Then you only have one smth process no matter how many files there are, and you need to buffer only a small amount of bytes (whatever intrinsic pipe buffering is going on) between the two processes. Of course, this is rather unrealistic if smth is a standard Unix/POSIX command, but might be an approach if you are writing it yourself.

If that is not possible, then find -print0 | xargs -0 smth is, likely, one of the better solutions. As @dave_thompson_085 mentioned in the comments, xargs does split up the arguments across multiple runs of smth when system limits are reached (by default, in the range of 128 KB or whatever limit is imposed by exec on the system), and has options to influence how many files are given to one call of smth, hence finding a balance between number of smth processes and initial delay.

EDIT: removed the notions of "best" - it is hard to say whether something better will crop up. ;)