Could the "reduce" function be parallelized in Functional Programming?

One of the most common reductions is the sum of all elements.

((a+b) + c) + d == (a + b) + (c+d)  # associative
a+b == b+a                          # commutative

That equality works for integers, so you can change the order of operations from one long dependency chain to multiple shorter dependency chains, allowing multithreading and SIMD parallelism.

It's also true for mathematical real numbers, but not for floating point numbers. In many cases, catastrophic cancellation is not expected, so the final result will be close enough to be worth the massive performance gain. For C/C++ compilers, this is one of the optimizations enabled by the -ffast-math option. (There's a -fassociative-math option for just this part of -ffast-math, without the assumptions about lack of infinities and NaNs.)

It's hard to get much SIMD speedup if one wide load can't scoop up multiple useful values. Intel's AVX2 added "gathered" loads, but the overhead is very high. With Haswell, it's typically faster to just use scalar code, but later microarchitectures do have faster gathers. So SIMD reduction is much more effective on arrays, or other data that is stored contiguously.

Modern SIMD hardware works by loading 2 consecutive double-precision floats into a vector register (for example, with 16B vectors like x86's sse). There is a packed-FP-add instruction that adds the corresponding elements of two vectors. So-called "vertical" vector operations (where the same operation happens between corresponding elements in two vectors) are much cheaper than "horizontal" operations (adding the two doubles in one vector to each other).

So at the asm level, you have a loop that sums all the even-numbered elements into one half of a vector accumulator, and all the odd-numbered elements into the other half. Then one horizontal operation at the end combines them. So even without multithreading, using SIMD requires associative operations (or at least, close enough to associative, like floating point usually is). If there's an approximate pattern in your input, like +1.001, -0.999, the cancellation errors from adding one big positive to one big negative number could be much worse than if each cancellation had happened separately.

With wider vectors, or narrower elements, a vector accumulator will hold more elements, increasing the benefit of SIMD.

Modern hardware has pipelined execution units that can sustain one (or sometimes two) FP vector-adds per clock, but the result of each one isn't ready for 5 cycles. Saturating the hardware's throughput capabilities requires using multiple accumulators in the loop, so there are 5 or 10 separate loop-carried dependency chains. (To be concrete, Intel Skylake does vector-FP multiply, add, or FMA (fused multiply-add) with 4c latency and one per 0.5c throughput. 4c/0.5c = 8 FP additions in flight at once to saturate Skylake's FP math unit. Each operation can be a 32B vector of eight single-precision floats, four double-precision floats, a 16B vector, or a scalar. (Keeping multiple operations in flight can speed up scalar stuff, too, but if there's any data-level parallelism available, you can probably vectorize it as well as use multiple accumulators.) See http://agner.org/optimize/ for x86 instruction timings, pipeline descriptions, and asm optimization stuff. But note that everything here applies to ARM with NEON, PPC Altivec, and other SIMD architectures. They all have vector registers and similar vector instructions.

For a concrete example, here's how gcc 5.3 auto-vectorizes a FP sum reduction. It only uses a single accumulator, so it's missing out on a factor of 8 throughput for Skylake. clang is a bit more clever, and uses two accumulators, but not as many as the loop unroll factor to get 1/4 of Skylake's max throughput. Note that if you take out -ffast-math from the compile options, the FP loop uses addss (add scalar single) rather than addps (add packed single). The integer loop still auto-vectorizes, because integer math is associative.

In practice, memory bandwidth is the limiting factor most of the time. Haswell and later Intel CPUs can sustain two 32B loads per cycle from L1 cache. In theory, they could sustain that from L2 cache. The shared L3 cache is another story: it's a lot faster than main memory, but its bandwidth is shared by all cores. This makes cache-blocking (aka loop tiling) for L1 or L2 a very important optimization when it can be done cheaply, when working with more than 256k of data. Rather than producing and then reducing 10MiB of data, produce in 128k chunks and reduce them while they're still in L2 cache instead of the producer having to push them to main memory and the reducer having to bring them back in. When working in a higher level language, your best bet may be to hope that the implementation does this for you. This is what you ideally want to happen in terms of what the CPU actually does, though.

Note that all the SIMD speedup stuff applies within a single thread operating on a contiguous chunk of memory. You (or the compiler for your functional language!) can and should use both techniques, to have multiple threads each saturating the execution units on the core they're running on.

Sorry for the lack of functional-programming in this answer. You may have guessed that I saw this question because of the SIMD tag. :P

I'm not going to try to generalize from addition to other operations. IDK what kind of stuff you functional-programming guys get up to with reductions, but addition or compare (find min/max, count matches) are the ones that get used as SIMD-optimization examples.

There are some compilers for functional programming languages that parallelize the reduce and map functions. This is an example from the Futhark programming language, which compiles into parallel CUDA and OpenCL source code:

let main (x: []i32) (y: []i32): i32 =
  reduce (+) 0 (map2 (*) x y)

It may be possible to write a compiler that would translate a subset of Haskell into Futhark, though this hasn't been done yet. The Futhark language does not allow recursive functions, but they may be implemented in a future version of the language.

Could the "reduce" function be parallelized in Functional Programming?

Tags:

Functional Programming

Simd

Parallel Processing

Related

Recent Posts