Speed up copying 1000000 small files

Assuming that

  • entries returned by readdir are not sorted by inode number
  • reading files in inode order reduces the number of seek operations
  • the content of most files are in the initial 8k allocation (an ext4 optimization) which also should yield less seek operations

you can try to speed up copying via copying files in inode order.

That means using something like this:

$ cd /mnt/src
$ ls -U -i | sort -k1,1 -n | cut -d' ' -f2- > ~/clist
$ xargs cp -t /mnt2/dst < ~/clist

GNU tar - in the pax tradition - handles hardlinks on its own.

cd "$srcdir" ; tar --hard-dereference -cf - ./* |
    tar -C"${tgtdir}" -vxf -

That way you only have the two tar processes and you don't need to keep invoking cp over and over again.


On a similar vein to @maxschlepzig's answer, you can parse the output of filefrag to sort files in the order that their first fragments appear on disk:

find . -maxdepth 1 -type f |
  xargs -d'\n' filefrag -v |
  sed -n '
    /^   0:        0../ {
      s/^.\{28\}\([0-9][0-9]*\).*/\1/
      h
      }
    / found$/ {
      s/:[^:]*$//
      H
      g
      s/\n/ /p
      }' |
    sort -nk 1,1 |
    cut -d' ' -f 2- |
    cpio -p dest_dir

MMV with the above sed script, so be sure to test thoroughly.

Otherwise, whatever you do, filefrag (part of e2fsprogs) will be much faster to use than hdparm as it can take multiple file arguments. Just the overhead of running hdparm 1,000,000 times is going to add a lot of overhead.

Also it probably wouldn't be so difficult to write a perl script (or C program), to a FIEMAP ioctl for each file, create a sorted array of the blocks that should be copied and the files the belong to and then to copy everything in order by reading the size of each block from the corresponding file (be careful not to run out of files descriptors though).