using parallel to process unique input files to unique output files

GNU Parallel is designed for this kind of tasks:

parallel customScript -c 33 -I -file {} -a -v 55 '>' {.}.output ::: *.input

or:

ls | parallel customScript -c 33 -I -file {} -a -v 55 '>' {.}.output

It will run one jobs per CPU core.

You can install GNU Parallel simply by:

wget https://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

The standard way to do this is to setup a queue and spawn any number of workers that know how to pull something from the queue and process it. You can use a fifo (aka named pipe) for communication between these processes.

Below is a naive example to demonstrate the concept.

A simple queue script:

#!/bin/sh
mkfifo /tmp/location-queue
for i in inputfiles/*; do
  echo $i > /tmp/location-queue
done
rm /tmp/location-queue

And a worker:

#!/bin/sh
while read file < /tmp/location-queue; do
  process_file "$file"
done

process_file could be defined somewhere in your worker, and it can do whatever you need it to do.

Once you have those two pieces, you can have a simple monitor that starts up the queue process and any number of worker processes.

Monitor script:

#!/bin/sh
queue.sh &
num_workers="$1"
i=0
while [ $i < $num_workers ]; do
  worker.sh &
  echo $! >> /tmp/worker.pids
  i=$((i+1))
done
monitor_workers

There you have it. If you actually do this, it's better to setup the fifo in the monitor, and pass the path to both the queue and the workers, so they are not coupled and not stuck to a specific location for the fifo. I set it up this way in the answer specifically so it's clear that what you're using as you read it.

Another example:

ls *.txt | parallel 'sort {} > {.}.sorted.txt'

I found the other examples unnecessarily complex, when in most cases the above is what you may have been searching for.

using parallel to process unique input files to unique output files

Tags:

Scripting

Gnu Parallel

Parallelism

Related

Recent Posts