Parallelising GIS operations in PyQGIS?

If you change your program to read the file name from the command line and split up your input file in smaller chunks, you can do something like this using GNU Parallel:

parallel my_processing.py {} /path/to/polygon_file.shp ::: input_files*.shp

This will run 1 job per core.

All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program for every line in a file
  • Run the same program for every block in a file

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel


Rather than using the GNU Parallel method you could use the python mutliprocess module to create a pool of tasks and execute them. I don't have access to a QGIS setup to test it on but multiprocess was added in Python 2.6 so provided that you are using 2.6 or later it should be available. There are a lot of examples online on using this module.


Here is the gnu parallel solution. With some care most emabrrassingly parallel linux based ogr or saga algorithms could be made to run with it inside your QGIS installation.

Obviously this solution requires the installation of gnu parallel. To install gnu parallel in Ubuntu, for example, go to your terminal and type

sudo apt-get -y install parallel

NB: I couldn't get the parallel shell command to work in Popen or subprocess, which I would have preferred, so I hacked together an export to a bash script and ran that with Popen instead.

Here is the specific shell command using parallel that I wrapped in python

parallel ogr2ogr -skipfailures -clipsrc tile_{1}.shp output_{1}.shp input.shp ::: {1..400}

Each {1} gets swapped out for a number from the {1..400} range and then the four hundred shell commands get managed by gnu parallel to concurrently use all the cores of my i7 :).

Here is the actual python code I wrote to solve the example problem I posted. One could paste it in directly after the end of the code in the question.

import stat
from subprocess import Popen
from subprocess import PIPE
feature_count=tile_layer.dataProvider().featureCount()
subprocess_args=["parallel", \
"ogr2ogr","-skipfailures","-clipsrc",\
os.path.join(output_folder,"tile_"+"{1}"+".shp"),\
os.path.join(output_folder,"output_"+"{1}"+".shp"),\
input_file,\
" ::: ","{1.."+str(feature_count)+"}"]
#Hacky part where I write the shell command to a script file
temp_script=os.path.join(output_folder,"parallelclip.sh")
f = open(temp_script,'w')
f.write("#!/bin/bash\n")
f.write(" ".join(subprocess_args)+'\n')
f.close()
st = os.stat(temp_script)
os.chmod(temp_script, st.st_mode | stat.S_IEXEC)
#End of hacky bash script export
p = Popen([os.path.join(output_folder,"parallelclip.sh")],\
stdin=PIPE, stdout=PIPE, stderr=PIPE)
#Below is the commented out Popen line I couldn't get to work
#p = Popen(subprocess_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, err = p.communicate(b"input data that is passed to subprocess' stdin")
rc = p.returncode
print output
print err

#Delete script and old clip files
os.remove(os.path.join(output_folder,"parallelclip.sh"))
for i in range(feature_count):
    delete_file = os.path.join(output_folder,"tile_"+str(i+1)+".shp")
    nosuff=os.path.splitext(delete_file)[0]
    suffix_list=[]
    suffix_list.append('.shx')
    suffix_list.append('.dbf')
    suffix_list.append('.qpj')
    suffix_list.append('.prj')
    suffix_list.append('.shp')
    suffix_list.append('.cpg')
    for suffix in suffix_list:
        try:
            os.remove(nosuff+suffix)
        except:
            pass

Let me tell you it's really something when you see all the cores fire up to full noise :). Special thanks to Ole and the team that built Gnu Parallel.

It would be nice to have a cross platform solution and it would be nice if I could have figured out the multiprocessing python module for the qgis embedded python but alas it was not to be.

Regardless this solution will serve me and maybe you nicely.