Limit number of matches of find command

As you're not using find for very much other than walking the directory tree, I'd suggest instead using the shell directly to do this. See variations for both zsh and bash below.


Using the zsh shell

mv ./**/*(-.D[1,1000]) /path/to/collection1    # move first 1000 files
mv ./**/*(-.D[1,1000]) /path/to/collection2    # move next 1000 files

The globbing pattern ./**/*(-.D[1,1000]) would match all regular files (or symbolic links to such files) in or under the current directory, and then return the 1000 first of these. The -. restricts the match to regular files or symbolic links to these, while D acts like dotglob in bash (matches hidden names).

This is assuming that the generated command would not grow too big through expanding the globbing pattern when calling mv.

The above is quite inefficient as it would expand the glob for each collection. You may therefore want to store the pathnames in an array and then move slices of that:

pathnames=( ./**/*(-.D) )

mv $pathnames[1,1000]    /path/to/collection1
mv $pathnames[1001,2000] /path/to/collection2

To randomise the pathnames array when you create it (you mentioned wanting to move random files):

pathnames=( ./**/*(-.Doe['REPLY=$RANDOM']) )

You could do a similar thing in bash (except you can't easily shuffle the result of a glob match in bash, apart for possibly feeding the results through shuf, so I'll skip that bit):

shopt -s globstar dotglob nullglob

pathnames=()
for pathname in ./**/*; do
    [[ -f $pathname ]] && pathnames+=( "$pathname" )
done

mv "${pathnames[@]:0:1000}"    /path/to/collection1
mv "${pathnames[@]:1000:1000}" /path/to/collection2
mv "${pathnames[@]:2000:1000}" /path/to/collection3

You can implement new tests for find using -exec:

seq 1 1000 |
find . -exec read \; -exec mv {} /path/to/collection1 +

will move the first 1000 files found to /path/to/collection1.

This works as follows:

  • seq 1 1000 outputs 1000 lines, piped into find;
  • -exec read reads a line, failing if the pipe is closed (when seq’s output has been consumed);
  • if the previous -exec succeeds, -exec mv ... performs the move.

-exec ... + works as you’d expect: read will run once per iteration, but find will accumulate matched files and call mv as few times as possible.

This relies on the fact that find’s -exec succeeds or fails based on the executed command’s exit status: when read succeeds, find continues processing the actions given above (because the default operator is “and”), and when it fails, find stops.

If your find supports the -quit action, you can use that to improve the efficiency:

seq 1 1000 |
find . \( -exec read \; -o -quit \) -exec mv {} /path/to/collection1 +

Without that, find will test every single file, even though it will only keep 1000 for mv.

I’m assuming that read is available as an external command, and implements the POSIX specification for read; if that’s not the case, sh -c read can be used instead. In both cases, find will start a separate process for each file it checks.


I don't think it can be done with just find. You can use something like:

find [... your parameters ...] -print0 | head -z -1000 | xargs -0 mv -t /path/to/collection

-print0, -z, and -0 work together to make sure everything works even with linefeeds in filenames.

Tags:

Find