Parallelise rsync using GNU Parallel

Following steps did the job for me:

Run the rsync --dry-run first in order to get the list of files those would be affected.

$ rsync -avzm --stats --safe-links --ignore-existing --dry-run \
    --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log

I fed the output of cat transfer.log to parallel in order to run 5 rsyncs in parallel, as follows:

$ cat /tmp/transfer.log | \
    parallel --will-cite -j 5 rsync -avzm --relative \
      --stats --safe-links --ignore-existing \
      --human-readable {} REMOTE-HOST:/data/ > result.log

Here, --relative option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/ directory), so the command must be run in the source folder (in example, /data/projects).

I personally use this simple one:

\ls -1 | parallel rsync -a {} /destination/directory/

Which only is useful when you have more than a few non-near-empty directories, else you'll end up having almost every rsync terminating and the last one doing all the job alone.

Note the backslash before ls which causes aliases to be skipped. Thus, ensuring that the output is as desired.

I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.

I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1.

The source drive was mounted like:

mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0

Using a single rsync process:

rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod

the io meter reads:

StoragePod  30.0T   144T      0  1.61K      0   130M
StoragePod  30.0T   144T      0  1.61K      0   130M
StoragePod  30.0T   144T      0  1.62K      0   130M

This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.

So, I built the file list and tried to run the sync again (I have a 64 core machine):

cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log

and it had the same performance!

StoragePod  29.9T   144T      0  1.63K      0   130M
StoragePod  29.9T   144T      0  1.62K      0   130M
StoragePod  29.9T   144T      0  1.56K      0   129M

As an alternative I simply ran rsync on the root folders:

rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell

This actually boosted performance:

StoragePod  30.1T   144T     13  3.66K   112K   343M
StoragePod  30.1T   144T     24  5.11K   184K   469M
StoragePod  30.1T   144T     25  4.30K   196K   373M

In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.

Parallelise rsync using GNU Parallel

Tags:

Linux

Rsync

Rhel

Gnu Parallel

Related

Recent Posts