Effective parallel processing of large items

Seralization of the data to a ByteArray object seems to overcome the data transfer bottleneck. The necessary functions BinarySerialize and BinaryDeserialize have been introduced in 11.1.

Here is a simple function implementing a ParallelMap which serializes the data before the transfer to the subkernels and makes the subkernels deseralize it before processing:

ParallelMapSerialized[f_, data_, opts___] := ParallelMap[
  f[BinaryDeserialize@#] &,
  BinarySerialize /@ data,
  opts
]

Running the benchmark again:

map = Map[
    FindCurvePath[#[[1 ;; difficulty]]] &,
    randomValues
    ]; // AbsoluteTiming

(* {9.60715, Null} *)

pmap = ParallelMap[
    FindCurvePath[#[[1 ;; difficulty]]] &,
    randomValues,
    Method -> "ItemsPerEvaluation" -> 10
    ]; // AbsoluteTiming

(* {17.5937, Null} *)

pmapserialized = ParallelMapSerialized[
    FindCurvePath[#[[1 ;; difficulty]]] &,
    randomValues,
    Method -> "ItemsPerEvaluation" -> 10
    ]; // AbsoluteTiming

(* {1.85387, Null} *)

pmap === pmap2 === pmapserialized
(* True *)

Serialization led to a performance increase of almost 10-fold compared to ParallelMap, and to a 5-fold increase compared to serial processing.


Sometimes it helps to make the shared variable local first.

pmap=ParallelMap[FindCurvePath[#[[1;;difficulty]]]&,randomValues];//AbsoluteTiming
({3.51073,Null})

index=Range[Length[randomValues]];
pmap3=ParallelMap[Module[{r=randomValues[[#]]},FindCurvePath[r[[1;;difficulty]]]]&,index];//AbsoluteTiming
{1.13677,Null}

In this case it is enough to just do the copying inside the loop rather than in the ParallelMap range.

index = Range[Length[randomValues]];
pmap4 = ParallelMap[
    FindCurvePath[randomValues[[#, 1 ;; difficulty]]] &, 
    index]; // AbsoluteTiming
{1.13,Null}