Understanding Parallelize methods

Suppose you are using ParallelMap on a list of 16 elements, such as Range[16], and that you have 4 kernels.

Items from the list are sent to subkernels in groups. CoarsestGrained means that the group sizes are maximized. Thus, the groups would be sent as

  • {1,2,3,4} to kernel 1
  • {5,6,7,8} to kernel 2
  • {9,10,11,12} to kernel 3
  • {13,14,15,16} to kernel 4

Now suppose that processing small numbers (1,2,...) is very fast, but processing larger numbers (..., 15, 16) is very slow. Then kernel 1 (which has all the quick computations) would finish quickly, and kernel 4 (which has all the slow ones) would run for a long time. The time during which all 4 kernels are simultaneously active would be short.

The solution is to minimize the group sizes, i.e. use FinestGrained. First, we send

  • {1} to kernel 1
  • {2} to kernel 2
  • {3} to kernel 3
  • {4} to kernel 4

Then as soon as a kernel is finished with its initially assigned group, we send the next group, i.e. {5}, to it. And so on.

With FinestGrained, all available kernels will be working in parallel for a longer duration.

Why would you ever want to use CoarsestGrained then? It's because in this example CoarsestGrained required 4 communications between the main kernel and subkernels. FinestGrained required 16 communications. Each communication carries an overhead. If this overhead is comparable to (or larger than) the time needed to process a single element from the list, then it will significantly increase the computation time.

To set a good balance, we can also set the group sizes explicitly, using "ItemsPerEvaluation" -> groupSize.


Examples

I ran the following examples on a 4-core CPU with 4 subkernels.

When to use FinestGrained

When each item takes relatively long to evaluate (much longer than the communication overhead), and some items take much longer to evaluate than others, then it makes sense to use small group sizes. The following example will also annotate each item with the ID of the kernel on which it was processed.

ParallelMap[(Pause[#]; Labeled[Framed[#], $KernelID]) &, Range[16], 
  Method -> "CoarsestGrained"] // AbsoluteTiming

enter image description here

ParallelMap[(Pause[#]; Labeled[Framed[#], $KernelID]) &, Range[16], 
  Method -> "FinestGrained"] // AbsoluteTiming

enter image description here

When to use CoarsestGrained

When each evaluation is very fast, taking less time than the communication overhead, then it makes sense to use large group sizes.

The following example is quite extreme, because squaring a number takes much, much less time than sending an expression to a subkernel.

array = RandomReal[1, 10000];

ParallelMap[#^2 &, array, 
   Method -> "CoarsestGrained"]; // AbsoluteTiming
(* {0.016309, Null} *)

ParallelMap[#^2 &, array, 
   Method -> "FinestGrained"]; // AbsoluteTiming
(* {4.85315, Null} *)

As far as I got it, Method->"CoarsGrained" devides the overall task into a small number of jobs (e.g. one job per parallel kernel). The required data for each kernel is sent in the beginning and then the kernels can work independently of each other (at least in an optimal configuration). In the end, the parallel kernels send their results to the main kernel and the latter generates the output. This strategy minimizes communication overhead during the computation, but if the jobs need different amounts of time (this can happen, e.g., when you compute different integrals that use local refinements that stops after a varying number of iterations), this can lead to a waste of time (e.g., if all jobs are finished and only the last job runs on a single kernel/core, the main kernel (and thus the user) will have to wait for the last parallel kernel to finish).

In contrast Method->"FinestGrained" uses probably something like a queue: Many small jobs are created (e.g., one for each iterator in a ParallelDo loop). Then the jobs are distributed one by one according the order in the queue; if a parallel kernel has finished a job, it asks the "master kernel" for a new one, gets send the required data, and starts the new job and so on ... until all jobs have been finished. The problem here is that a lot of communication happens. And there is also a lot potential for delay.

Which strategy works best really depends on the problem at hand.