Executing piped commands in parallel

A problem with split --filter is that the output can be mixed up, so you get half a line from process 1 followed by half a line from process 2.

GNU Parallel guarantees there will be no mixup.

So assume you want to do:

 A | B | C

But that B is terribly slow, and thus you want to parallelize that. Then you can do:

A | parallel --pipe B | C

GNU Parallel by default splits on \n and a block size of 1 MB. This can be adjusted with --recend and --block.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 67bd7bc7dc20aff99eb8f1266574dadb
12345678 67bd7bc7 dc20aff9 9eb8f126 6574dadb
$ md5sum install.sh | grep b7a15cdbb07fb6e11b0338577bc1780f
b7a15cdb b07fb6e1 1b033857 7bc1780f
$ sha512sum install.sh | grep 186000b62b66969d7506ca4f885e0c80e02a22444
6f25960b d4b90cf6 ba5b76de c1acdf39 f3d24249 72930394 a4164351 93a7668d
21ff9839 6f920be5 186000b6 2b66969d 7506ca4f 885e0c80 e02a2244 40e8a43f
$ bash install.sh

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1


When you write A | B, both processes already run in parallel. If you see them as using only one core, that's probably because either of CPU affinity settings (perhaps there is some tool to spawn a process with different affinity) or because one process isn't enough to hold a whole core, and the system "prefers" not to spread out computing.

To run several B's with one A, you need a tool such as split with the --filter option:

A | split [OPTIONS] --filter="B"

This, however, is liable to mess up the order of lines in the output, because the B jobs won't be running all at the same speed. If this is an issue, you might need to redirect B i-th output to an intermediate file and stitch them together at the end using cat. This, in turn, may require a considerable disk space.

Other options exist (e.g. you could limit each instance of B to a single line-buffered output, wait until a whole "round" of B's has finished, run the equivalent of a reduce to split's map, and cat the temporary output together), with varying levels of efficiency. The 'round' option just described for example will wait for the slowest instance of B to finish, so it will be greatly dependent on the available buffering for B; [m]buffer might help, or it might not, depending on what the operations are.

Examples

Generate the first 1000 numbers and count the lines in parallel:

seq 1 1000 | split -n r/10 -u --filter="wc -l"
100
100
100
100
100
100
100
100
100
100

If we were to "mark" the lines, we'd see that each first line is sent to process #1, each fifth line to process #5 and so on. Moreover, in the time it takes split to spawn the second process, the first is already a good way into its quota:

seq 1 1000 | split -n r/10 -u --filter="sed -e 's/^/$RANDOM - /g'" | head -n 10
19190 - 1
19190 - 11
19190 - 21
19190 - 31
19190 - 41
19190 - 51
19190 - 61
19190 - 71
19190 - 81

When executing on a 2-core machine, seq, split and the wc processes share the cores; but looking closer, the system leaves the first two processes on CPU0, and divides CPU1 among the worker processes:

%Cpu0  : 47.2 us, 13.7 sy,  0.0 ni, 38.1 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 15.8 us, 82.9 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.3 hi,  0.0 si,  0.0 st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM     TIME+ COMMAND
 5314 lserni    20   0  4516  568  476 R 23.9  0.0   0:03.30 seq
 5315 lserni    20   0  4580  720  608 R 52.5  0.0   0:07.32 split
 5317 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.86 wc
 5318 lserni    20   0  4520  572  484 S 14.0  0.0   0:01.88 wc
 5319 lserni    20   0  4520  576  484 S 13.6  0.0   0:01.88 wc
 5320 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.85 wc
 5321 lserni    20   0  4520  572  484 S 13.3  0.0   0:01.84 wc
 5322 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.86 wc
 5323 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.86 wc
 5324 lserni    20   0  4520  576  484 S 13.3  0.0   0:01.87 wc

Notice especially that split is eating a considerable amount of CPU. This will decrease in proportion to A's needs; i.e., if A is a heavier process than seq, the relative overhead of split will decrease. But if A is a very lightweight process and B is quite fast (so that you need no more than 2-3 B's to keep along with A), then parallelizing with split (or pipes in general) might well not be worth it.