Parallelize Bash script with maximum number of processes

Depending on what you want to do xargs also can help (here: converting documents with pdf2ps):

cpus=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w )

find . -name \*.pdf | xargs --max-args=1 --max-procs=$cpus  pdf2ps

From the docs:

--max-procs=max-procs
-P max-procs
       Run up to max-procs processes at a time; the default is 1.
       If max-procs is 0, xargs will run as many processes as  possible  at  a
       time.  Use the -n option with -P; otherwise chances are that only one
       exec will be done.

With GNU Parallel http://www.gnu.org/software/parallel/ you can write:

some-command | parallel do-something

GNU Parallel also supports running jobs on remote computers. This will run one per CPU core on the remote computers - even if they have different number of cores:

some-command | parallel -S server1,server2 do-something

A more advanced example: Here we list of files that we want my_script to run on. Files have extension (maybe .jpeg). We want the output of my_script to be put next to the files in basename.out (e.g. foo.jpeg -> foo.out). We want to run my_script once for each core the computer has and we want to run it on the local computer, too. For the remote computers we want the file to be processed transferred to the given computer. When my_script finishes, we want foo.out transferred back and we then want foo.jpeg and foo.out removed from the remote computer:

cat list_of_files | \
parallel --trc {.}.out -S server1,server2,: \
"my_script {} > {.}.out"

GNU Parallel makes sure the output from each job does not mix, so you can use the output as input for another program:

some-command | parallel do-something | postprocess

See the videos for more examples: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Tags:

Bash